Appendix

Regular Expressions in Perl


CONTENTS

Perl's regular expression handling is one of its most powerful features and is one of the main reasons that Perl is a nearly ideal language for CGI programming. Text manipulation is central to many CGI applications, and the proficiency with which regular expressions search and replace text is without parallel. To beginners, however, the terse notation, the many options, and the alternative ways of forming a pattern can be somewhat daunting. If you are new to regular expressions, don't expect to master all the rules on your first pass through the summary below. As is the case with most other features in Perl, you don't need to know everything there is to know about regular expressions in order to begin using them effectively. If, on the other hand, you are already familiar with regular expression matching, you will still find this appendix helpful for its description of the usage of regular expressions in Perl 5 as opposed to earlier implementations (such as Henry Spencer's original design, used by the Unix egrep command).

The Rules of Regular Expression Matching

A regular expression consists of zero or more alternative patterns, which are strings of elements. Patterns are separated by the vertical bar character (|), and the whole expression is usually delimited by forward slashes (/), followed by zero or more of the option characters g, i, m, o, s, or x. Regular expressions almost always appear within delimiters, and these delimiters are spoken of as if they were a part of the regular expression itself, even though they do not participate in the matching. An element is either an atom, quantified or unquantified, or an assertion. An unquantified atom always matches a single character, whereas a quantified atom can match zero or more characters. An assertion matches a contextual condition, such as the beginning or end of a string, and does not absorb any of the matched string's characters. A regular expression matches a string if any one of its patterns matches some part of that string, element-for-element. Testing always proceeds from left to right and stops at the first complete match. The individual elements match as described in the following sections.

Unquantified Atoms

As an unquantified atom, each character matches itself, unless it is one of the special characters +, ?, ., *, ^, $, (, ), [, ], {, }, |, or \ (not including the commas, which are used here only for readability). The actual meanings of these special characters will become apparent below. To match one of them as a literal character, you can precede it with a backslash to "escape" its special meaning. For example, the special character . (period) is a wildcard that matches any single character, but \. matches only a period. In general, a preceding \ escapes the special meaning of any non-alphanumeric character, but it converts most alphanumeric characters into special atoms or assertions. Thus you can also use \ on itself, or on /, which is a special character only when it is being used as the delimiter; for instance, the /­delimited regular expression /\/\\/ matches /\ inside any string. (For an explanation of using other non-alphanumeric characters as delimiters, see the sections on Perl's m// and s/// operators, later in this appendix). All of the special atoms are enumerated below, and match as follows:

. (period)Matches any character except a newline. Will match a newline if option s (single-line match) is specified.
\wMatches any alphanumeric character, including _.
\WMatches any non-alphanumeric character, excluding _.
\sMatches one whitespace character; that is, a tab, newline, vertical tab, form feed, carriage return, or space (ASCII 9 through 13 and 32), which individually match \t, \n, \v, \f, \r, and \040, respectively.
\SMatches one non-whitespace character.
\dMatches a digit, 0 through 9.
\DMatches any non-numerical character.
\NNNMatches the character specified by the 2- or 3-digit octal number NNN, unless it would be interpreted as a back-reference (see the definition of /N below). For example, \177 matches the DEL character (ASCII 127).
\xXXMatches the character represented by hexadecimal value XX; for example, \xA9 matches the copyright character © (ISO Latin-1 169).
\cCMatches the control character Ctrl-C, where C is any single character; for example, \cH matches a backspace (ASCII 8). This atom is the same as \NNN, where NNN is the octal value of ord(C) + 64.
[S]Matches any character in the class S, where S is specified as a string of literal characters (as in [abc$%^&]), a range of characters in ASCII order (as in [a-z]), or any combination thereof (as in [a-c$-&^]). Most of the special characters lose their special meanings inside the square brackets, but the hyphen must be escaped as \-, the \b character matches a backspace (\010), and most other backslashed characters retain their special meanings as atoms or assertions.
(E)Matches any regular expression E and stores the substring matching the Nth parenthesized expression in the special read-only memory variable $N (that is, in $1, $2, etc.). The parentheses serve both to group a string of elements or patterns into one atom and to mark that atom for future reference.
\NMatches whatever the Nth parenthesized atom actually matched, where N = 1, 2, 3....up to the total number of preceding parenthesized atoms. Such an atom is called a back-reference to a subexpression.
(?:E)Matches the regular expression E but does not store the match in any $N variable for back-referencing.

Quantifiers and Quantified Atoms

The regular expression quantifiers are the special characters +, *, ?, and the expressions {N}, {N,}, and {N,M}. A quantified atom is an atom that is followed by a quantifier. If A is any atom, A+ matches A one or more times; that is, it matches one or more adjacent substrings that each match A individually. Similarly, A* matches A zero or more times, and A? matches zero or one occurrence of A. Furthermore, A{N} matches A exactly N times, A{N,} matches A N or more times, and A{N,M} matches a minimum of N and a maximum of M occurrences of A. A quantified atom matches as many characters as possible, unless a ? is appended to the quantifier, in which case the atom matches the smallest substring allowed by the context. Thus /(ab+)([bc])/ and /(ab+?)([bc])/ both match abbc, but the first expression sets $1 and $2 to abb and c, respectively, whereas the second expression sets $1 and $2 to ab and b (for the meanings of $1 and $2, see the parenthesis rule above, in the table entry for (E)).

Assertions

An assertion is different from an atom in that it doesn't match any characters but rather matches a contextual condition, such as a difference between two adjacent characters. Because assertions cannot add any characters to a matched substring, they are said to have zero width. Assertions match as follows:

\AMatches the beginning of a string.
\ZMatches the end of a string.
^ and $These are like \A and \Z except that in multiline mode (option m), ^ and $ match the beginning and end of every line (that is, after and before every newline character), respectively.
\bMatches a word boundary.
\BMatches a non-boundary.
\GMatches the point at which the previous global search (option g) left off.
(?=E)Matches the beginning of the regular expression E, without including E as part of the matched substring. In other words, E must be present for (?=E) to match, but the match has no effect on subsequent matching or processing. This is called a zero-width positive look-ahead assertion.
(?!E)Matches the absence of the regular expression E. This is called a zero-width negative look-ahead assertion.
(?#T)Matches anything and nothing; T is only an embedded comment. That is, /(?#T)/ always returns 1, no matter what string is searched, but the matched substring is always null.
(?M)Matches anything, like (?#T), except that M is an embedded pattern-match modifier, namely one or more of the options i, m, s, or x. (For a description of these options, see the discussion of m// below.) The specified options(s) affect the entire search, the same as if they are appended to the ending delimiter as modifiers.

Examples of Regular Expressions

The following list of examples and the descriptions of what they match covers the essentials of regular expression matching in Perl 5.

Regular ExpressionWhat It Matches in Perl
/abc/abc anywhere in the search string.
/^abc/abc at the beginning of the string.
/abc$/abc at the end of the string.
/(abc)/abc anywhere in the string; the matched expression is stored in $1.
/ab|cd/ab or cd, whichever comes first.
/a(b|c)d/a followed by b or c, then d (abd or acd, not abcd).
/ab{3}c/a followed by exactly 3 b's, then by c. This is the same as /abbbc/.
/ab{1,3}c/a followed by 1, 2, or 3 b's; then by c. This is the same as /abb?b?c/.
/ab?c/a followed by c with an optional b in between (ac or abc). This is the same as /ab{0,1}c/.
/ab*c/a followed by zero or more b's, then c (ac, abc, abbc, etc.). This is the same as /ab{0,}c/.
/ab+c/a followed by one or more b's, then c (abc, abbc, etc.). This is the same as /ab{1,}c/.
/[abc]/Any single character in the bracketed class, namely, a or b or c. This is the same as /[a-c]/ and /a|b|c/.
/[abc]+/Any string of one or more characters from the bracketed class (a, b, c, aa, ab, ac, ba, bb, bc, etc.).
/[^abc]/Any single character not in the class inside the brackets. (Note that the ^ character has a different special meaning at the beginning of a character class than at the beginning of a pattern. In the interior of a character class, or as an element in the interior of a pattern and not preceded by \n, ^ matches itself.
/\w+/Any string of alphanumeric characters, including _. This is the same as /[0-9A-Z_a-z]+/.
/\W+/Any string of non-alphanumeric characters. This is the same as /[^\w]+/.
/abe\b/abe followed by a word boundary (the zero-width space between alphanumeric and non-alphanumeric characters, that is, between characters matched by \w and \W); this expression will not match the abe in abecedarian.
/./Any single character except a newline (\n).
/((.|\n)+)/Any string of one or more characters, including \n; $1 will contain the whole string, and $2 will contain only the last character matched.
/name=([^&]*)&ident=\1(&|$)/ A string of the form name=val&ident=val, followed either by & or the end of the string; val can be made up of any characters besides & and \n and will be placed in the special read-only memory variable $1 (see the parenthesis rule in the section on "Unquantified Atoms," above).
/(ab+)([bc])/a followed by one or more b's (as many as possible), then either a b or a c (abb, abc, abbb, abbc, etc.). If the last character matched is c, all of the b's will be placed in $1, following the initial a, and $2 will be assigned the value c. Otherwise, the matched string must contain at least two b's, and $2 will be assigned the last matched b.
/(ab+?)([bc])/a followed by one or more b's (as few as possible, because of the ?), then either a b or a c. In other words, this expression can match only the substrings abb or abc. After a match, the only possible value for $1 is ab, whereas $2 will be either b or c.
/<[^>]*?(>|$)|(^|\G)[^<>]*?>/ Any full tag delimited by angle brackets, or any partial tag broken by a line ending. That is, any substring that begins with < and ends with either > or the end of the string as a whole, or any substring that starts at the beginning of the string and ends with >.
/<[^>]*?>/mAny angle-bracket-delimited tag, even one that spans many line endings within the search string.

Operators That Use Regular Expressions in Perl

Perl has three operators that search strings for regular expression matches: m/pattern/, ?pattern?, and s/pattern/replacement/ (also known simply as //, ??, and s or s///). The first of these is usually written without the optional m; indeed, /pattern/ is practically synonymous with regular expression matching in Perl and other programming languages.

The ?pattern? operator is just like /pattern/, except that it matches only once between calls to the reset operator. This can be useful when you want to see only the first match in a file, for instance, but there are better ways to accomplish this, and the ?? operator may be removed from future versions of Perl.

The usage of the search-and-replace operator, s/pattern/replacement/, also closely follows that of m//, except that with s/// any part of the search string that matches the regular expression pattern is replaced by replacement (which is not a regular expression). This extra step makes the s/// operator so powerful that entire programs can be written using almost nothing else.

The following sections go into more detail about how to use the m// and s/// operators. For further examples (and a great deal of other valuable reference information), see Perl's online documentation, especially perlop.html and perlre.html. In the NTPerl distribution, these files can be found in the docs subdirectory under the main Perl directory, and in MacPerl, they can be found in the pod subfolder. In Unix installations, these files are usually kept in a directory such as /usr/local/lib/perl5/pod and may have to be converted to HTML format with the pod2html utility program, which should reside in the same directory as the Perl interpreter itself. (Try /usr/local/bin/pod2html.)

Perl's Regular Expression Matching Operator: m//

Perl's pattern matching operator m// is used as follows:

$match = (string =~ m/pattern/options);

This construct searches a string for a regular expression and assigns the return value true (1) or false ('') to $match. If string contains the regular expression pattern, as modified by the options, the value of $match will be 1; otherwise it will be ''. The =~ is called the pattern binding operator. Despite its appearance (and its association with the search-and-replace operator, s///), =~ is not some kind of fancy assignment operator but is a logical operator like == or eq (which denote numerical and string equality, respectively). Its opposite is !~, which causes the expression (string ~! /pattern/) to evaluate to true if and only if string does not match pattern.

The value of the whole expression (string =~ m/pattern/) depends on both of the operands string and m/pattern/, and neither is changed as a result of the operation. If you omit the search string, Perl will search the special variable $_. You can also leave out the m, as long as you are using slashes to delimit the regular expression. Thus /pattern/ all by itself is equivalent to ($_ =~ m/pattern/). If you keep the m, you can use almost any character as the pattern delimiter, as long as it doesn't explicitly appear within the pattern. The # character is often used to delimit patterns that contain /'s, as in

print "local\n" if ($path =~ m#/usr/local/bin/#);

On the other hand, the slashes are not a problem in

$pattern = "/usr/local/bin";
print "$pattern\n" if ($path =~ m/$pattern/);

You can also use the bracketing character pairs [], (), {}, and <> as the opening and closing delimiters. Otherwise, the same character must be used to mark both the beginning and the end of the regular expression.

The options tell Perl how to optimize and perform the matching: g will cause the search to match as many times as possible (in other words, to perform a "global" search); i will cause the search to be case-insensitive; o will interpolate any variables in the pattern only once; m will cause the string to be searched as multiple lines (slower); s will cause the string to be searched as a single line only (faster; this is the default); and x enables Perl's extensions to regular expressions. The only such extension documented in Perl 5 is to ignore any white space in a search pattern. This can make the pattern much easier to read, but it also means that a literal white space character in the pattern string will not match itself in the search string. (The very last to example in this appendix illustrates an effective use of the /x modifier along with embedded comments.)

Perl's Search-and-Replace Operator: s///

Perl's search-and-replace operator, s///, is used as follows:

$matches = (string =~ s/pattern/replacement/options);

This construct searches string for the regular expression pattern, replaces one or all of any matching substrings with replacement, and returns the number of substitutions made. If there were no matches, the s operator returns false (''). If no string is specified via the =~ or !~ operator, the special variable $_ is searched and modified. If specified, string must be an lvalue-that is, either a variable that evaluates as a scalar value or an assignment to such a variable.

As with the m// operator, the pattern delimiter can be nearly any non-alphanumeric character instead of /, and a few such delimiters have special meanings. (You can think of the ?? operator in this way.) If the delimiter chosen is the single quote character, no variable interpolation is done on either the pattern or the replacement. Otherwise, if pattern contains a $ followed by an alphanumeric character (so that it looks like a variable rather than an end-of-string test), the variable will be interpolated into pattern at runtime. Variables in replacement will also be interpolated. (The /e modifier forces this behavior even if the delimiter is the single quote character.) If backquotes are used as delimiters, the replacement string will be executed as a shell command and its output will be used as the actual replacement text. If pattern is delimited by a pair of bracketing characters, replacement must have its own pair of delimiters, which need not be the same. Two examples of this approach are s(foo)[bar] and s <foos>/ball/.

The options are the same as for m//, except that the /g modifier causes the pattern matching operation to replace all occurrences of the pattern (in other words, to perform a global replacement), and to return the total number of replacements. There is also one additional option: the /e modifier causes the operation to evaluate the replacement string as a full-fledged Perl expression (possibly using the equivalent of an eval), as in:

$escapes = ($name =~ s/%([0-9A-Za-z][0-9A-Za-z])/pack("C", ™
hex($1))/eg);

Here the value of the special "memory" variable $1 will be whatever has just matched the parenthesized sub-expression ([0-9A-Za-z][0-9A-Za-z]), namely, a 2­character string representing a hexidecimal number. The hex function returns the decimal equivalent of this hexidecimal number, and the pack function with the parameter "C" returns the ASCII character corresponding to this number. Thus the overall effect of this expression is to replace a URL-escaped character with the equivalent literal character. (Note that % is not a special character; it matches only itself.) Further examples of /e's usage are given below, as well as a caveat.

Examples of s///'s Usage

Many of the following examples are identical or similar to those on Perl's manual page (as converted to perlre.html), but here they are accompanied by explanations:

The following statement replaces all occurrences of green as a whole word in the current contents of $_:

s/\bgreen\b/mauve/g;        # don't change wintergreen

This statement replaces uses | instead of / as the delimiter, and replaces the first occurrence of /usr/bin with /usr/local/bin in $path:

$path =~ s|/usr/bin|/usr/local/bin|;

This one substitutes the current values of $foo and $bar in the search pattern and the replacement string, respectively, before performing the search and replacement operation:

s/Login: $foo/Login: $bar/;       # pattern computed at runtime

If $foo or $bar is not defined, it is replaced by nothing. Note that when $ appears in the interior of a search string, it loses its special meaning as an assertion that matches the end of a line or string. To match $ as a literal character, however, you have to use \$.

The following statement assigns the value of $bar to $foo and then replaces the first occurrence of this with that in $foo but not in $bar:

($foo = $bar) =~ s/this/that/;

This one uses memory variables to reverse the first two space-separated sub-strings in $_:

s/([^ ]*) *([^ ]*)/$2 $1/;  # reverse the first two fields

In the next example, the replacement string is actually a Perl expression, so you have to use the /e modifier:

s/(\d+) elf/($1 != 1 ? "$1 elves" : $&)/ge;

This statement replaces all substrings in $_ that consist of a number followed by the word elf with the same number followed by the word elves, unless the number is 1, in which case a matching substring is replaced by itself. The special variable $& always contains the string matched by the last (successful) pattern match.

With the /e modifier, you can also call your own subroutines (as opposed to Perl's built-in functions) within a replacement expression:

s/^=(\w+)/&myFunc($1)/ge;      # use function call

In my experience, however, this only works well in Perl 5 on a Unix platform. Both NTPerl and MacPerl NTPerl are prone to abort execution with diagnostic messages such as "Out of memory" or "panic: realloc," so beware.

You can even nest the /e modifiers; the following statement will expand simple embedded variables in $_:

s/(\$\w+)/$1/eeg;

The following statement finds all the relative hyperlinks in the text stored in $html and replaces them with forms:

$html =~ s[<A\s+HREF\s*=\s*"?/(.*)"?\s*>\s*(.+?)\s*</A>]
{
<FORM ACTION=/cgi-bin/myParser.cgi/$1>
<INPUT TYPE=SUBMIT NAME=SubmitFromLink VALUE="$2">
</FORM>
}ig;

The hyperlink reference field from the HREF attribute is turned into extra path information at the end of the FORM tag's ACTION attribute, and the link's anchor text becomes the VALUE field in the INPUT element that defines a SUBMIT button. The line breaks between the bracketing delimiters {} are included in the replacement string.

This last example removes all SSI-style directives from the text stored in $html:

$html =~ s {
    <!-#   (?# Match the opening delimiter)
     .*?    (?# Match a minimal number of characters)
     ->    (?# Match the closing delimiter)
} []gsx;

Here the /s modifier causes the search string to be treated as a single line, and the /x modifier causes any white space in the search pattern to be ignored. (Note that this is not the same things as ignoring white space in the search string; for that, use \s*, as in the previous example.) The expressions delimited by (?# and ) don't match anything; they are merely comments embedded in the search expression. The replacement string, delimited by [], is nothing.