pack

pack TEMPLATE,LIST

Takes a LIST of values and converts it into a string using the rules given by the TEMPLATE. The resulting string is the concatenation of the converted values. Typically, each converted value looks like its machine-level representation. For example, on 32-bit machines an integer may be represented by a sequence of 4 bytes, which will in Perl be presented as a string that's 4 characters long.

See perlpacktut for an introduction to this function.

The TEMPLATE is a sequence of characters that give the order and type of values, as follows:

a  A string with arbitrary binary data, will be null padded.
A  A text (ASCII) string, will be space padded.
Z  A null-terminated (ASCIZ) string, will be null padded.

b  A bit string (ascending bit order inside each byte,
   like vec()).
B  A bit string (descending bit order inside each byte).
h  A hex string (low nybble first).
H  A hex string (high nybble first).

c  A signed char (8-bit) value.
C  An unsigned char (octet) value.
W  An unsigned char value (can be greater than 255).

s  A signed short (16-bit) value.
    S  An unsigned short value.

    l  A signed long (32-bit) value.
L  An unsigned long value.

q  A signed quad (64-bit) value.
    Q  An unsigned quad value.
     (Quads are available only if your system supports 64-bit
      integer values _and_ if Perl has been compiled to support
      those.  Raises an exception otherwise.)

i  A signed integer value.
I  A unsigned integer value.
     (This 'integer' is _at_least_ 32 bits wide.  Its exact
      size depends on what a local C compiler calls 'int'.)

n  An unsigned short (16-bit) in "network" (big-endian) order.
N  An unsigned long (32-bit) in "network" (big-endian) order.
v  An unsigned short (16-bit) in "VAX" (little-endian) order.
V  An unsigned long (32-bit) in "VAX" (little-endian) order.

j  A Perl internal signed integer value (IV).
J  A Perl internal unsigned integer value (UV).

f  A single-precision float in native format.
d  A double-precision float in native format.

F  A Perl internal floating-point value (NV) in native format
D  A float of long-double precision in native format.
     (Long doubles are available only if your system supports
      long double values _and_ if Perl has been compiled to
      support those.  Raises an exception otherwise.
      Note that there are different long double formats.)

p  A pointer to a null-terminated string.
P  A pointer to a structure (fixed-length string).

u  A uuencoded string.
U  A Unicode character number.  Encodes to a character in char-
   acter mode and UTF-8 (or UTF-EBCDIC in EBCDIC platforms) in
   byte mode.

w  A BER compressed integer (not an ASN.1 BER, see perlpacktut
   for details).  Its bytes represent an unsigned integer in
   base 128, most significant digit first, with as few digits
   as possible.  Bit eight (the high bit) is set on each byte
   except the last.

x  A null byte (a.k.a ASCII NUL, "\000", chr(0))
X  Back up a byte.
@  Null-fill or truncate to absolute position, counted from the
   start of the innermost ()-group.
.  Null-fill or truncate to absolute position specified by
   the value.
(  Start of a ()-group.

One or more modifiers below may optionally follow certain letters in the TEMPLATE (the second column lists letters for which the modifier is valid):

!   sSlLiI     Forces native (short, long, int) sizes instead
               of fixed (16-/32-bit) sizes.

    !   xX         Make x and X act as alignment commands.

    !   nNvV       Treat integers as signed instead of unsigned.

    !   @.         Specify position as byte offset in the internal
                                  representation of the packed string.  Efficient
                                  but dangerous.

    >   sSiIlLqQ   Force big-endian byte-order on the type.
            jJfFdDpP   (The "big end" touches the construct.)

    <   sSiIlLqQ   Force little-endian byte-order on the type.
            jJfFdDpP   (The "little end" touches the construct.)

The > and < modifiers can also be used on () groups to force a particular byte-order on all components in that group, including all its subgroups.

The following rules apply:

  • Each letter may optionally be followed by a number indicating the repeat count. A numeric repeat count may optionally be enclosed in brackets, as in pack("C[80]", @arr) . The repeat count gobbles that many values from the LIST when used with all format types other than a , A , Z , b , B , h , H , @ , ., x , X , and P , where it means something else, described below. Supplying a * for the repeat count instead of a number means to use however many items are left, except for:

    • @ , x , and X , where it is equivalent to 0 .

    • <.>, where it means relative to the start of the string.

    • u , where it is equivalent to 1 (or 45, which here is equivalent).

    One can replace a numeric repeat count with a template letter enclosed in brackets to use the packed byte length of the bracketed template for the repeat count.

    For example, the template x[L] skips as many bytes as in a packed long, and the template "$t X[$t] $t" unpacks twice whatever $t (when variable-expanded) unpacks. If the template in brackets contains alignment commands (such as x![d] ), its packed length is calculated as if the start of the template had the maximal possible alignment.

    When used with Z , a * as the repeat count is guaranteed to add a trailing null byte, so the resulting string is always one byte longer than the byte length of the item itself.

    When used with @ , the repeat count represents an offset from the start of the innermost () group.

    When used with ., the repeat count determines the starting position to calculate the value offset as follows:

    • If the repeat count is 0 , it's relative to the current position.

    • If the repeat count is * , the offset is relative to the start of the packed string.

    • And if it's an integer n, the offset is relative to the start of the nth innermost ( ) group, or to the start of the string if n is bigger then the group level.

    The repeat count for u is interpreted as the maximal number of bytes to encode per line of output, with 0, 1 and 2 replaced by 45. The repeat count should not be more than 65.

  • The a , A , and Z types gobble just one value, but pack it as a string of length count, padding with nulls or spaces as needed. When unpacking, A strips trailing whitespace and nulls, Z strips everything after the first null, and a returns data with no stripping at all.

    If the value to pack is too long, the result is truncated. If it's too long and an explicit count is provided, Z packs only $count-1 bytes, followed by a null byte. Thus Z always packs a trailing null, except when the count is 0.

  • Likewise, the b and B formats pack a string that's that many bits long. Each such format generates 1 bit of the result. These are typically followed by a repeat count like B8 or B64 .

    Each result bit is based on the least-significant bit of the corresponding input character, i.e., on ord($char)%2. In particular, characters "0" and "1" generate bits 0 and 1, as do characters "\000" and "\001" .

    Starting from the beginning of the input string, each 8-tuple of characters is converted to 1 character of output. With format b , the first character of the 8-tuple determines the least-significant bit of a character; with format B , it determines the most-significant bit of a character.

    If the length of the input string is not evenly divisible by 8, the remainder is packed as if the input string were padded by null characters at the end. Similarly during unpacking, "extra" bits are ignored.

    If the input string is longer than needed, remaining characters are ignored.

    A * for the repeat count uses all characters of the input field. On unpacking, bits are converted to a string of 0 s and 1 s.

  • The h and H formats pack a string that many nybbles (4-bit groups, representable as hexadecimal digits, "0".."9" "a".."f" ) long.

    For each such format, pack() generates 4 bits of result. With non-alphabetical characters, the result is based on the 4 least-significant bits of the input character, i.e., on ord($char)%16. In particular, characters "0" and "1" generate nybbles 0 and 1, as do bytes "\000" and "\001" . For characters "a".."f" and "A".."F" , the result is compatible with the usual hexadecimal digits, so that "a" and "A" both generate the nybble 0xA==10 . Use only these specific hex characters with this format.

    Starting from the beginning of the template to pack(), each pair of characters is converted to 1 character of output. With format h , the first character of the pair determines the least-significant nybble of the output character; with format H , it determines the most-significant nybble.

    If the length of the input string is not even, it behaves as if padded by a null character at the end. Similarly, "extra" nybbles are ignored during unpacking.

    If the input string is longer than needed, extra characters are ignored.

    A * for the repeat count uses all characters of the input field. For unpack(), nybbles are converted to a string of hexadecimal digits.

  • The p format packs a pointer to a null-terminated string. You are responsible for ensuring that the string is not a temporary value, as that could potentially get deallocated before you got around to using the packed result. The P format packs a pointer to a structure of the size indicated by the length. A null pointer is created if the corresponding value for p or P is undef; similarly with unpack(), where a null pointer unpacks into undef.

    If your system has a strange pointer size--meaning a pointer is neither as big as an int nor as big as a long--it may not be possible to pack or unpack pointers in big- or little-endian byte order. Attempting to do so raises an exception.

  • The / template character allows packing and unpacking of a sequence of items where the packed structure contains a packed item count followed by the packed items themselves. This is useful when the structure you're unpacking has encoded the sizes or repeat counts for some of its fields within the structure itself as separate fields.

    For pack, you write length-item/sequence-item, and the length-item describes how the length value is packed. Formats likely to be of most use are integer-packing ones like n for Java strings, w for ASN.1 or SNMP, and N for Sun XDR.

    For pack, sequence-item may have a repeat count, in which case the minimum of that and the number of available items is used as the argument for length-item. If it has no repeat count or uses a '*', the number of available items is used.

    For unpack, an internal stack of integer arguments unpacked so far is used. You write /sequence-item and the repeat count is obtained by popping off the last element from the stack. The sequence-item must not have a repeat count.

    If sequence-item refers to a string type ("A" , "a" , or "Z" ), the length-item is the string length, not the number of strings. With an explicit repeat count for pack, the packed string is adjusted to that length. For example:

    This code:                             gives this result:
    
    unpack("W/a", "\004Gurusamy")          ("Guru")
    unpack("a3/A A*", "007 Bond  J ")      (" Bond", "J")
    unpack("a3 x2 /A A*", "007: Bond, J.") ("Bond, J", ".")
    
    pack("n/a* w/a","hello,","world")     "\000\006hello,\005world"
    pack("a/W2", ord("a") .. ord("z"))    "2ab"
    

    The length-item is not returned explicitly from unpack.

    Supplying a count to the length-item format letter is only useful with A , a , or Z . Packing with a length-item of a or Z may introduce "\000" characters, which Perl does not regard as legal in numeric strings.

  • The integer types s, S , l , and L may be followed by a ! modifier to specify native shorts or longs. As shown in the example above, a bare l means exactly 32 bits, although the native long as seen by the local C compiler may be larger. This is mainly an issue on 64-bit platforms. You can see whether using ! makes any difference this way:

       printf "format s is %d, s! is %d\n", 
    length pack("s"), length pack("s!");
    
       printf "format l is %d, l! is %d\n", 
    length pack("l"), length pack("l!");
    

    i! and I! are also allowed, but only for completeness' sake: they are identical to i and I .

    The actual sizes (in bytes) of native shorts, ints, longs, and long longs on the platform where Perl was built are also available from the command line:

    $ perl -V:{short,int,long{,long}}size
    shortsize='2';
    intsize='4';
    longsize='4';
    longlongsize='8';
    

    or programmatically via the Config module:

    use Config;
    print $Config{shortsize},    "\n";
    print $Config{intsize},      "\n";
    print $Config{longsize},     "\n";
    print $Config{longlongsize}, "\n";
    

    $Config{longlongsize} is undefined on systems without long long support.

  • The integer formats s, S , i , I , l , L , j , and J are inherently non-portable between processors and operating systems because they obey native byteorder and endianness. For example, a 4-byte integer 0x12345678 (305419896 decimal) would be ordered natively (arranged in and handled by the CPU registers) into bytes as

    0x12 0x34 0x56 0x78  # big-endian
    0x78 0x56 0x34 0x12  # little-endian
    

    Basically, Intel and VAX CPUs are little-endian, while everybody else, including Motorola m68k/88k, PPC, Sparc, HP PA, Power, and Cray, are big-endian. Alpha and MIPS can be either: Digital/Compaq uses (well, used) them in little-endian mode, but SGI/Cray uses them in big-endian mode.

    The names big-endian and little-endian are comic references to the egg-eating habits of the little-endian Lilliputians and the big-endian Blefuscudians from the classic Jonathan Swift satire, Gulliver's Travels. This entered computer lingo via the paper "On Holy Wars and a Plea for Peace" by Danny Cohen, USC/ISI IEN 137, April 1, 1980.

    Some systems may have even weirder byte orders such as

    0x56 0x78 0x12 0x34
    0x34 0x12 0x78 0x56
    

    These are called mid-endian, middle-endian, mixed-endian, or just weird.

    You can determine your system endianness with this incantation:

    printf("%#02x ", $_) for unpack("W*", pack L=>0x12345678);
    

    The byteorder on the platform where Perl was built is also available via Config:

    use Config;
    print "$Config{byteorder}\n";
    

    or from the command line:

    $ perl -V:byteorder
    

    Byteorders "1234" and "12345678" are little-endian; "4321" and "87654321" are big-endian. Systems with multiarchitecture binaries will have "ffff" , signifying that static information doesn't work, one must use runtime probing.

    For portably packed integers, either use the formats n , N , v , and V or else use the > and < modifiers described immediately below. See also perlport.

  • Also floating point numbers have endianness. Usually (but not always) this agrees with the integer endianness. Even though most platforms these days use the IEEE 754 binary format, there are differences, especially if the long doubles are involved. You can see the Config variables doublekind and longdblkind (also doublesize , longdblsize ): the "kind" values are enums, unlike byteorder .

    Portability-wise the best option is probably to keep to the IEEE 754 64-bit doubles, and of agreed-upon endianness. Another possibility is the "%a" ) format of printf.

  • Starting with Perl 5.10.0, integer and floating-point formats, along with the p and P formats and () groups, may all be followed by the > or < endianness modifiers to respectively enforce big- or little-endian byte-order. These modifiers are especially useful given how n , N , v , and V don't cover signed integers, 64-bit integers, or floating-point values.

    Here are some concerns to keep in mind when using an endianness modifier:

    • Exchanging signed integers between different platforms works only when all platforms store them in the same format. Most platforms store signed integers in two's-complement notation, so usually this is not an issue.

    • The > or < modifiers can only be used on floating-point formats on big- or little-endian machines. Otherwise, attempting to use them raises an exception.

    • Forcing big- or little-endian byte-order on floating-point values for data exchange can work only if all platforms use the same binary representation such as IEEE floating-point. Even if all platforms are using IEEE, there may still be subtle differences. Being able to use > or < on floating-point values can be useful, but also dangerous if you don't know exactly what you're doing. It is not a general way to portably store floating-point values.

    • When using > or < on a () group, this affects all types inside the group that accept byte-order modifiers, including all subgroups. It is silently ignored for all other types. You are not allowed to override the byte-order within a group that already has a byte-order modifier suffix.

  • Real numbers (floats and doubles) are in native machine format only. Due to the multiplicity of floating-point formats and the lack of a standard "network" representation for them, no facility for interchange has been made. This means that packed floating-point data written on one machine may not be readable on another, even if both use IEEE floating-point arithmetic (because the endianness of the memory representation is not part of the IEEE spec). See also perlport.

    If you know exactly what you're doing, you can use the > or < modifiers to force big- or little-endian byte-order on floating-point values.

    Because Perl uses doubles (or long doubles, if configured) internally for all numeric calculation, converting from double into float and thence to double again loses precision, so unpack("f", pack("f", $foo)) will not in general equal $foo.

  • Pack and unpack can operate in two modes: character mode (C0 mode) where the packed string is processed per character, and UTF-8 byte mode (U0 mode) where the packed string is processed in its UTF-8-encoded Unicode form on a byte-by-byte basis. Character mode is the default unless the format string starts with U . You can always switch mode mid-format with an explicit C0 or U0 in the format. This mode remains in effect until the next mode change, or until the end of the () group it (directly) applies to.

    Using C0 to get Unicode characters while using U0 to get non-Unicode bytes is not necessarily obvious. Probably only the first of these is what you want:

    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | 
      perl -CS -ne 'printf "%v04X\n", $_ for unpack("C0A*", $_)'
    03B1.03C9
    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | 
      perl -CS -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
    CE.B1.CF.89
    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | 
      perl -C0 -ne 'printf "%v02X\n", $_ for unpack("C0A*", $_)'
    CE.B1.CF.89
    $ perl -CS -E 'say "\x{3B1}\x{3C9}"' | 
      perl -C0 -ne 'printf "%v02X\n", $_ for unpack("U0A*", $_)'
    C3.8E.C2.B1.C3.8F.C2.89
    

    Those examples also illustrate that you should not try to use pack/unpack as a substitute for the Encode module.

  • You must yourself do any alignment or padding by inserting, for example, enough "x" es while packing. There is no way for pack() and unpack() to know where characters are going to or coming from, so they handle their output and input as flat sequences of characters.

  • A () group is a sub-TEMPLATE enclosed in parentheses. A group may take a repeat count either as postfix, or for unpack(), also via the / template character. Within each repetition of a group, positioning with @ starts over at 0. Therefore, the result of

    pack("@1A((@2A)@3A)", qw[X Y Z])
    

    is the string "\0X\0\0YZ" .

  • x and X accept the ! modifier to act as alignment commands: they jump forward or back to the closest position aligned at a multiple of count characters. For example, to pack() or unpack() a C structure like

       struct {
    char   c;    /* one signed, 8-bit character */
    double d; 
    char   cc[2];
       }
    

    one may need to use the template c x![d] d c[2] . This assumes that doubles must be aligned to the size of double.

    For alignment commands, a count of 0 is equivalent to a count of 1; both are no-ops.

  • n , N , v and V accept the ! modifier to represent signed 16-/32-bit integers in big-/little-endian order. This is portable only when all platforms sharing packed data use the same binary representation for signed integers; for example, when all platforms use two's-complement representation.

  • Comments can be embedded in a TEMPLATE using # through the end of line. White space can separate pack codes from each other, but modifiers and repeat counts must follow immediately. Breaking complex templates into individual line-by-line components, suitably annotated, can do as much to improve legibility and maintainability of pack/unpack formats as /x can for complicated pattern matches.

  • If TEMPLATE requires more arguments than pack() is given, pack() assumes additional "" arguments. If TEMPLATE requires fewer arguments than given, extra arguments are ignored.

  • Attempting to pack the special floating point values Inf and NaN (infinity, also in negative, and not-a-number) into packed integer values (like "L" ) is a fatal error. The reason for this is that there simply isn't any sensible mapping for these special values into integers.

Examples:

$foo = pack("WWWW",65,66,67,68);
# foo eq "ABCD"
$foo = pack("W4",65,66,67,68);
# same thing
$foo = pack("W4",0x24b6,0x24b7,0x24b8,0x24b9);
# same thing with Unicode circled letters.
$foo = pack("U4",0x24b6,0x24b7,0x24b8,0x24b9);
# same thing with Unicode circled letters.  You don't get the
# UTF-8 bytes because the U at the start of the format caused
# a switch to U0-mode, so the UTF-8 bytes get joined into
# characters
$foo = pack("C0U4",0x24b6,0x24b7,0x24b8,0x24b9);
# foo eq "\xe2\x92\xb6\xe2\x92\xb7\xe2\x92\xb8\xe2\x92\xb9"
# This is the UTF-8 encoding of the string in the
# previous example

$foo = pack("ccxxcc",65,66,67,68);
# foo eq "AB\0\0CD"

# NOTE: The examples above featuring "W" and "c" are true
# only on ASCII and ASCII-derived systems such as ISO Latin 1
# and UTF-8.  On EBCDIC systems, the first example would be
#      $foo = pack("WWWW",193,194,195,196);

$foo = pack("s2",1,2);
# "\001\000\002\000" on little-endian
# "\000\001\000\002" on big-endian

$foo = pack("a4","abcd","x","y","z");
# "abcd"

$foo = pack("aaaa","abcd","x","y","z");
# "axyz"

$foo = pack("a14","abcdefg");
# "abcdefg\0\0\0\0\0\0\0"

$foo = pack("i9pl", gmtime);
# a real struct tm (on my system anyway)

$utmp_template = "Z8 Z8 Z16 L";
$utmp = pack($utmp_template, @utmp1);
# a struct utmp (BSDish)

@utmp2 = unpack($utmp_template, $utmp);
# "@utmp1" eq "@utmp2"

sub bintodec {
    unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
}

$foo = pack('sx2l', 12, 34);
# short 12, two zero bytes padding, long 34
$bar = pack('s@4l', 12, 34);
# short 12, zero fill to position 4, long 34
# $foo eq $bar
$baz = pack('s.l', 12, 4, 34);
# short 12, zero fill to position 4, long 34

$foo = pack('nN', 42, 4711);
# pack big-endian 16- and 32-bit unsigned integers
$foo = pack('S>L>', 42, 4711);
# exactly the same
$foo = pack('s<l<', -42, 4711);
# pack little-endian 16- and 32-bit signed integers
$foo = pack('(sl)<', -42, 4711);
# exactly the same

The same template may generally also be used in unpack().

doc_perl
2016-12-06 03:21:56
Comments
Leave a Comment

Please login to continue.