Gory details of parsing quoted constructs

Gory details of parsing quoted constructs

When presented with something that might have several different interpretations, Perl uses the DWIM (that's "Do What I Mean") principle to pick the most probable interpretation. This strategy is so successful that Perl programmers often do not suspect the ambivalence of what they write. But from time to time, Perl's notions differ substantially from what the author honestly meant.

This section hopes to clarify how Perl handles quoted constructs. Although the most common reason to learn this is to unravel labyrinthine regular expressions, because the initial steps of parsing are the same for all quoting operators, they are all discussed together.

The most important Perl parsing rule is the first one discussed below: when processing a quoted construct, Perl first finds the end of that construct, then interprets its contents. If you understand this rule, you may skip the rest of this section on the first reading. The other rules are likely to contradict the user's expectations much less frequently than this first one.

Some passes discussed below are performed concurrently, but because their results are the same, we consider them individually. For different quoting constructs, Perl performs different numbers of passes, from one to four, but these passes are always performed in the same order.

  • Finding the end

    The first pass is finding the end of the quoted construct. This results in saving to a safe location a copy of the text (between the starting and ending delimiters), normalized as necessary to avoid needing to know what the original delimiters were.

    If the construct is a here-doc, the ending delimiter is a line that has a terminating string as the content. Therefore <<EOF is terminated by EOF immediately followed by "\n" and starting from the first column of the terminating line. When searching for the terminating line of a here-doc, nothing is skipped. In other words, lines after the here-doc syntax are compared with the terminating string line by line.

    For the constructs except here-docs, single characters are used as starting and ending delimiters. If the starting delimiter is an opening punctuation (that is (, [, {, or < ), the ending delimiter is the corresponding closing punctuation (that is ), ], }, or >). If the starting delimiter is an unpaired character like / or a closing punctuation, the ending delimiter is the same as the starting delimiter. Therefore a / terminates a qq// construct, while a ] terminates both qq[] and qq]] constructs.

    When searching for single-character delimiters, escaped delimiters and \\ are skipped. For example, while searching for terminating /, combinations of \\ and \/ are skipped. If the delimiters are bracketing, nested pairs are also skipped. For example, while searching for a closing ] paired with the opening [, combinations of \\ , \], and \[ are all skipped, and nested [ and ] are skipped as well. However, when backslashes are used as the delimiters (like qq\\ and tr\\\), nothing is skipped. During the search for the end, backslashes that escape delimiters or other backslashes are removed (exactly speaking, they are not copied to the safe location).

    For constructs with three-part delimiters (s///, y///, and tr///), the search is repeated once more. If the first delimiter is not an opening punctuation, the three delimiters must be the same, such as s!!! and tr))), in which case the second delimiter terminates the left part and starts the right part at once. If the left part is delimited by bracketing punctuation (that is () , [] , {} , or <> ), the right part needs another pair of delimiters such as s(){} and tr[]//. In these cases, whitespace and comments are allowed between the two parts, although the comment must follow at least one whitespace character; otherwise a character expected as the start of the comment may be regarded as the starting delimiter of the right part.

    During this search no attention is paid to the semantics of the construct. Thus:

    "$hash{"$foo/$bar"}"
    

    or:

    m/
      bar	# NOT a comment, this slash / terminated m//!
     /x
    

    do not form legal quoted expressions. The quoted part ends on the first " and /, and the rest happens to be a syntax error. Because the slash that terminated m// was followed by a SPACE , the example above is not m//x, but rather m// with no /x modifier. So the embedded # is interpreted as a literal # .

    Also no attention is paid to \c\ (multichar control char syntax) during this search. Thus the second \ in qq/\c\/ is interpreted as a part of \/, and the following / is not recognized as a delimiter. Instead, use \034 or \x1c at the end of quoted constructs.

  • Interpolation

    The next step is interpolation in the text obtained, which is now delimiter-independent. There are multiple cases.

    • <<'EOF'

      No interpolation is performed. Note that the combination \\ is left intact, since escaped delimiters are not available for here-docs.

    • m'', the pattern of s'''

      No interpolation is performed at this stage. Any backslashed sequences including \\ are treated at the stage to parsing regular expressions.

    • '' , q//, tr''', y''', the replacement of s'''

      The only interpolation is removal of \ from pairs of \\ . Therefore "-" in tr''' and y''' is treated literally as a hyphen and no character range is available. \1 in the replacement of s''' does not work as $1 .

    • tr///, y///

      No variable interpolation occurs. String modifying combinations for case and quoting such as \Q , \U , and \E are not recognized. The other escape sequences such as \200 and \t and backslashed characters such as \\ and \- are converted to appropriate literals. The character "-" is treated specially and therefore \- is treated as a literal "-" .

    • "" , `` , qq//, qx//, <file*glob> , <<"EOF"

      \Q , \U , \u , \L , \l , \F (possibly paired with \E ) are converted to corresponding Perl constructs. Thus, "$foo\Qbaz$bar" is converted to $foo . (quotemeta("baz" . $bar)) internally. The other escape sequences such as \200 and \t and backslashed characters such as \\ and \- are replaced with appropriate expansions.

      Let it be stressed that whatever falls between \Q and \E is interpolated in the usual way. Something like "\Q\\E" has no \E inside. Instead, it has \Q , \\ , and E , so the result is the same as for "\\\\E" . As a general rule, backslashes between \Q and \E may lead to counterintuitive results. So, "\Q\t\E" is converted to quotemeta("\t"), which is the same as "\\\t" (since TAB is not alphanumeric). Note also that:

      $str = '\t';
      return "\Q$str";
      

      may be closer to the conjectural intention of the writer of "\Q\t\E" .

      Interpolated scalars and arrays are converted internally to the join and "." catenation operations. Thus, "$foo XXX '@arr'" becomes:

      $foo . " XXX '" . (join $", @arr) . "'";
      

      All operations above are performed simultaneously, left to right.

      Because the result of "\Q STRING \E" has all metacharacters quoted, there is no way to insert a literal $ or @ inside a \Q\E pair. If protected by \ , $ will be quoted to become "\\\$" ; if not, it is interpreted as the start of an interpolated scalar.

      Note also that the interpolation code needs to make a decision on where the interpolated scalar ends. For instance, whether "a $x -> {c}" really means:

      "a " . $x . " -> {c}";
      

      or:

      "a " . $x -> {c};
      

      Most of the time, the longest possible text that does not include spaces between components and which contains matching braces or brackets. because the outcome may be determined by voting based on heuristic estimators, the result is not strictly predictable. Fortunately, it's usually correct for ambiguous cases.

    • the replacement of s///

      Processing of \Q , \U , \u , \L , \l , \F and interpolation happens as with qq// constructs.

      It is at this step that \1 is begrudgingly converted to $1 in the replacement text of s///, in order to correct the incorrigible sed hackers who haven't picked up the saner idiom yet. A warning is emitted if the use warnings pragma or the -w command-line flag (that is, the $^W variable) was set.

    • RE in ?RE? , /RE/ , m/RE/, s/RE/foo/,

      Processing of \Q , \U , \u , \L , \l , \F , \E , and interpolation happens (almost) as with qq// constructs.

      Processing of \N{...} is also done here, and compiled into an intermediate form for the regex compiler. (This is because, as mentioned below, the regex compilation may be done at execution time, and \N{...} is a compile-time construct.)

      However any other combinations of \ followed by a character are not substituted but only skipped, in order to parse them as regular expressions at the following step. As \c is skipped at this step, @ of \c@ in RE is possibly treated as an array symbol (for example @foo ), even though the same text in qq// gives interpolation of \c@ .

      Code blocks such as (?{BLOCK}) are handled by temporarily passing control back to the perl parser, in a similar way that an interpolated array subscript expression such as "foo$array[1+f("[xyz")]bar" would be.

      Moreover, inside (?{BLOCK}), (?# comment ), and a # -comment in a /x-regular expression, no processing is performed whatsoever. This is the first step at which the presence of the /x modifier is relevant.

      Interpolation in patterns has several quirks: $| , $( , $) , @+ and @- are not interpolated, and constructs $var[SOMETHING] are voted (by several different estimators) to be either an array element or $var followed by an RE alternative. This is where the notation ${arr[$bar]} comes handy: /${arr[0-9]}/ is interpreted as array element -9 , not as a regular expression from the variable $arr followed by a digit, which would be the interpretation of /$arr[0-9]/ . Since voting among different estimators may occur, the result is not predictable.

      The lack of processing of \\ creates specific restrictions on the post-processed text. If the delimiter is /, one cannot get the combination \/ into the result of this step. / will finish the regular expression, \/ will be stripped to / on the previous step, and \\/ will be left as is. Because / is equivalent to \/ inside a regular expression, this does not matter unless the delimiter happens to be character special to the RE engine, such as in s*foo*bar*, m[foo], or ?foo? ; or an alphanumeric char, as in:

      m m ^ a \s* b mmx;
      

      In the RE above, which is intentionally obfuscated for illustration, the delimiter is m, the modifier is mx , and after delimiter-removal the RE is the same as for m/ ^ a \s* b /mx . There's more than one reason you're encouraged to restrict your delimiters to non-alphanumeric, non-whitespace choices.

    This step is the last one for all constructs except regular expressions, which are processed further.

  • parsing regular expressions

    Previous steps were performed during the compilation of Perl code, but this one happens at run time, although it may be optimized to be calculated at compile time if appropriate. After preprocessing described above, and possibly after evaluation if concatenation, joining, casing translation, or metaquoting are involved, the resulting string is passed to the RE engine for compilation.

    Whatever happens in the RE engine might be better discussed in perlre, but for the sake of continuity, we shall do so here.

    This is another step where the presence of the /x modifier is relevant. The RE engine scans the string from left to right and converts it into a finite automaton.

    Backslashed characters are either replaced with corresponding literal strings (as with \{), or else they generate special nodes in the finite automaton (as with \b ). Characters special to the RE engine (such as |) generate corresponding nodes or groups of nodes. (?#...) comments are ignored. All the rest is either converted to literal strings to match, or else is ignored (as is whitespace and # -style comments if /x is present).

    Parsing of the bracketed character class construct, [...] , is rather different than the rule used for the rest of the pattern. The terminator of this construct is found using the same rules as for finding the terminator of a {} -delimited construct, the only exception being that ] immediately following [ is treated as though preceded by a backslash.

    The terminator of runtime (?{...}) is found by temporarily switching control to the perl parser, which should stop at the point where the logically balancing terminating } is found.

    It is possible to inspect both the string given to RE engine and the resulting finite automaton. See the arguments debug /debugcolor in the use re pragma, as well as Perl's -Dr command-line switch documented in Command Switches in perlrun.

  • Optimization of regular expressions

    This step is listed for completeness only. Since it does not change semantics, details of this step are not documented and are subject to change without notice. This step is performed over the finite automaton that was generated during the previous pass.

    It is at this stage that split() silently optimizes /^/ to mean /^/m .

doc_perl
2016-12-06 03:20:13
Comments
Leave a Comment

Please login to continue.