Quote and Quote-like Operators

Quote and Quote-like Operators

While we usually think of quotes as literal values, in Perl they function as operators, providing various kinds of interpolating and pattern matching capabilities. Perl provides customary quote characters for these behaviors, but also provides a way for you to choose your quote character for any of them. In the following table, a {} represents any pair of delimiters you choose.

   Customary  Generic        Meaning	     Interpolates
''	 q{}	      Literal		  no
""	qq{}	      Literal		  yes
``	qx{}	      Command		  yes*
	qw{}	     Word list		  no
//	 m{}	   Pattern match	  yes*
	qr{}	      Pattern		  yes*
	 s{}{}	    Substitution	  yes*
	tr{}{}	  Transliteration	  no (but see below)
	 y{}{}	  Transliteration	  no (but see below)
       <<EOF                 here-doc            yes*

* unless the delimiter is ''.

Non-bracketing delimiters use the same character fore and aft, but the four sorts of ASCII brackets (round, angle, square, curly) all nest, which means that

q{foo{bar}baz}

is the same as

'foo{bar}baz'

Note, however, that this does not always work for quoting Perl code:

$s = q{ if($x eq "}") ... }; # WRONG

is a syntax error. The Text::Balanced module (standard as of v5.8, and from CPAN before then) is able to do this properly.

There can be whitespace between the operator and the quoting characters, except when # is being used as the quoting character. q#foo# is parsed as the string foo , while q #foo# is the operator q followed by a comment. Its argument will be taken from the next line. This allows you to write:

s {foo}  # Replace foo
        {bar}  # with bar.

The following escape sequences are available in constructs that interpolate, and in transliterations:

Sequence     Note  Description
\t                  tab               (HT, TAB)
\n                  newline           (NL)
\r                  return            (CR)
\f                  form feed         (FF)
\b                  backspace         (BS)
\a                  alarm (bell)      (BEL)
\e                  escape            (ESC)
\x{263A}     [1,8]  hex char          (example: SMILEY)
\x1b         [2,8]  restricted range hex char (example: ESC)
\N{name}     [3]    named Unicode character or character sequence
\N{U+263D}   [4,8]  Unicode character (example: FIRST QUARTER MOON)
\c[          [5]    control char      (example: chr(27))
\o{23072}    [6,8]  octal char        (example: SMILEY)
\033         [7,8]  restricted range octal char  (example: ESC)
  • [1]

    The result is the character specified by the hexadecimal number between the braces. See [8] below for details on which character.

    Only hexadecimal digits are valid between the braces. If an invalid character is encountered, a warning will be issued and the invalid character and all subsequent characters (valid or invalid) within the braces will be discarded.

    If there are no valid digits between the braces, the generated character is the NULL character (\x{00} ). However, an explicit empty brace (\x{} ) will not cause a warning (currently).

  • [2]

    The result is the character specified by the hexadecimal number in the range 0x00 to 0xFF. See [8] below for details on which character.

    Only hexadecimal digits are valid following \x . When \x is followed by fewer than two valid digits, any valid digits will be zero-padded. This means that \x7 will be interpreted as \x07 , and a lone "\x" will be interpreted as \x00 . Except at the end of a string, having fewer than two valid digits will result in a warning. Note that although the warning says the illegal character is ignored, it is only ignored as part of the escape and will still be used as the subsequent character in the string. For example:

    Original    Result    Warns?
    "\x7"       "\x07"    no
    "\x"        "\x00"    no
    "\x7q"      "\x07q"   yes
    "\xq"       "\x00q"   yes
    
  • [3]

    The result is the Unicode character or character sequence given by name. See charnames.

  • [4]

    \N{U+hexadecimal number} means the Unicode character whose Unicode code point is hexadecimal number.

  • [5]

    The character following \c is mapped to some other character as shown in the table:

    Sequence   Value
      \c@      chr(0)
      \cA      chr(1)
      \ca      chr(1)
      \cB      chr(2)
      \cb      chr(2)
      ...
      \cZ      chr(26)
      \cz      chr(26)
      \c[      chr(27)
                        # See below for chr(28)
      \c]      chr(29)
      \c^      chr(30)
      \c_      chr(31)
      \c?      chr(127) # (on ASCII platforms; see below for link to
                        #  EBCDIC discussion)
    

    In other words, it's the character whose code point has had 64 xor'd with its uppercase. \c? is DELETE on ASCII platforms because ord("?") ^ 64 is 127, and \c@ is NULL because the ord of "@" is 64, so xor'ing 64 itself produces 0.

    Also, \c\X yields chr(28) . "X" for any X, but cannot come at the end of a string, because the backslash would be parsed as escaping the end quote.

    On ASCII platforms, the resulting characters from the list above are the complete set of ASCII controls. This isn't the case on EBCDIC platforms; see OPERATOR DIFFERENCES in perlebcdic for a full discussion of the differences between these for ASCII versus EBCDIC platforms.

    Use of any other character following the "c" besides those listed above is discouraged, and as of Perl v5.20, the only characters actually allowed are the printable ASCII ones, minus the left brace "{" . What happens for any of the allowed other characters is that the value is derived by xor'ing with the seventh bit, which is 64, and a warning raised if enabled. Using the non-allowed characters generates a fatal error.

    To get platform independent controls, you can use \N{...} .

  • [6]

    The result is the character specified by the octal number between the braces. See [8] below for details on which character.

    If a character that isn't an octal digit is encountered, a warning is raised, and the value is based on the octal digits before it, discarding it and all following characters up to the closing brace. It is a fatal error if there are no octal digits at all.

  • [7]

    The result is the character specified by the three-digit octal number in the range 000 to 777 (but best to not use above 077, see next paragraph). See [8] below for details on which character.

    Some contexts allow 2 or even 1 digit, but any usage without exactly three digits, the first being a zero, may give unintended results. (For example, in a regular expression it may be confused with a backreference; see Octal escapes in perlrebackslash.) Starting in Perl 5.14, you may use \o{} instead, which avoids all these problems. Otherwise, it is best to use this construct only for ordinals \077 and below, remembering to pad to the left with zeros to make three digits. For larger ordinals, either use \o{} , or convert to something else, such as to hex and use \N{U+} (which is portable between platforms with different character sets) or \x{} instead.

  • [8]

    Several constructs above specify a character by a number. That number gives the character's position in the character set encoding (indexed from 0). This is called synonymously its ordinal, code position, or code point. Perl works on platforms that have a native encoding currently of either ASCII/Latin1 or EBCDIC, each of which allow specification of 256 characters. In general, if the number is 255 (0xFF, 0377) or below, Perl interprets this in the platform's native encoding. If the number is 256 (0x100, 0400) or above, Perl interprets it as a Unicode code point and the result is the corresponding Unicode character. For example \x{50} and \o{120} both are the number 80 in decimal, which is less than 256, so the number is interpreted in the native character set encoding. In ASCII the character in the 80th position (indexed from 0) is the letter "P" , and in EBCDIC it is the ampersand symbol "&" . \x{100} and \o{400} are both 256 in decimal, so the number is interpreted as a Unicode code point no matter what the native encoding is. The name of the character in the 256th position (indexed by 0) in Unicode is LATIN CAPITAL LETTER A WITH MACRON .

    There are a couple of exceptions to the above rule. \N{U+hex number} is always interpreted as a Unicode code point, so that \N{U+0050} is "P" even on EBCDIC platforms. And if use encoding is in effect, the number is considered to be in that encoding, and is translated from that into the platform's native encoding if there is a corresponding native character; otherwise to Unicode.

NOTE: Unlike C and other languages, Perl has no \v escape sequence for the vertical tab (VT, which is 11 in both ASCII and EBCDIC), but you may use \N{VT} , \ck , \N{U+0b} , or \x0b . (\v does have meaning in regular expression patterns in Perl, see perlre.)

The following escape sequences are available in constructs that interpolate, but not in transliterations.

  \l		lowercase next character only
  \u		titlecase (not uppercase!) next character only
  \L		lowercase all characters till \E or end of string
  \U		uppercase all characters till \E or end of string
  \F		foldcase all characters till \E or end of string
  \Q          quote (disable) pattern metacharacters till \E or
              end of string
  \E		end either case modification or quoted section
(whichever was last seen)

See quotemeta for the exact definition of characters that are quoted by \Q .

\L , \U , \F , and \Q can stack, in which case you need one \E for each. For example:

say"This \Qquoting \ubusiness \Uhere isn't quite\E done yet,\E is it?";
This quoting\ Business\ HERE\ ISN\'T\ QUITE\ done\ yet\, is it?

If a use locale form that includes LC_CTYPE is in effect (see perllocale), the case map used by \l , \L , \u , and \U is taken from the current locale. If Unicode (for example, \N{} or code points of 0x100 or beyond) is being used, the case map used by \l , \L , \u , and \U is as defined by Unicode. That means that case-mapping a single character can sometimes produce a sequence of several characters. Under use locale , \F produces the same results as \L for all locales but a UTF-8 one, where it instead uses the Unicode definition.

All systems use the virtual "\n" to represent a line terminator, called a "newline". There is no such thing as an unvarying, physical newline character. It is only an illusion that the operating system, device drivers, C libraries, and Perl all conspire to preserve. Not all systems read "\r" as ASCII CR and "\n" as ASCII LF. For example, on the ancient Macs (pre-MacOS X) of yesteryear, these used to be reversed, and on systems without a line terminator, printing "\n" might emit no actual data. In general, use "\n" when you mean a "newline" for your system, but use the literal ASCII when you need an exact character. For example, most networking protocols expect and prefer a CR+LF ("\015\012" or "\cM\cJ" ) for line terminators, and although they often accept just "\012" , they seldom tolerate just "\015" . If you get in the habit of using "\n" for networking, you may be burned some day.

For constructs that do interpolate, variables beginning with "$ " or "@ " are interpolated. Subscripted variables such as $a[3] or $href->{key}[0] are also interpolated, as are array and hash slices. But method calls such as $obj->meth are not.

Interpolating an array or slice interpolates the elements in order, separated by the value of $" , so is equivalent to interpolating join $", @array . "Punctuation" arrays such as @* are usually interpolated only if the name is enclosed in braces @{*}, but the arrays @_ , @+ , and @- are interpolated even without braces.

For double-quoted strings, the quoting from \Q is applied after interpolation and escapes are processed.

"abc\Qfoo\tbar$s\Exyz"

is equivalent to

"abc" . quotemeta("foo\tbar$s") . "xyz"

For the pattern of regex operators (qr//, m// and s///), the quoting from \Q is applied after interpolation is processed, but before escapes are processed. This allows the pattern to match literally (except for $ and @ ). For example, the following matches:

'\s\t' =~ /\Q\s\t/

Because $ or @ trigger interpolation, you'll need to use something like /\Quser\E\@\Qhost/ to match them literally.

Patterns are subject to an additional level of interpretation as a regular expression. This is done as a second pass, after variables are interpolated, so that regular expressions may be incorporated into the pattern from the variables. If this is not what you want, use \Q to interpolate a variable literally.

Apart from the behavior described above, Perl does not expand multiple levels of interpolation. In particular, contrary to the expectations of shell programmers, back-quotes do NOT interpolate within double quotes, nor do single quotes impede evaluation of variables when used within double quotes.

doc_perl
2016-12-06 03:26:53
Comments
Leave a Comment

Please login to continue.