PCREMATCHING(3)                                                PCREMATCHING(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE MATCHING ALGORITHMS

       This  document  describes  the  two different algorithms
       that are available in PCRE for matching a compiled regu-
       lar  expression  against  a  given  subject  string. The
       "standard"  algorithm  is  the  one  provided   by   the
       pcre_exec()  function.   This  works  in the same was as
       Perl's matching function, and provides a Perl-compatible
       matching operation.

       An    alternative   algorithm   is   provided   by   the
       pcre_dfa_exec() function; this operates in  a  different
       way,  and  is not Perl-compatible. It has advantages and
       disadvantages compared with the standard algorithm,  and
       these are described below.

       When  there  is  only  one possible way in which a given
       subject string can match a pattern, the  two  algorithms
       give the same answer. A difference arises, however, when
       there are multiple possibilities. For  example,  if  the
       pattern

         ^<.*>

       is matched against the string

         <something> <something else> <something further>

       there are three possible answers. The standard algorithm
       finds only one of them, whereas  the  alternative  algo-
       rithm finds all three.

REGULAR EXPRESSIONS AS TREES

       The set of strings that are matched by a regular expres-
       sion can be represented as a tree structure.  An  unlim-
       ited  repetition  in the pattern makes the tree of infi-
       nite size, but it is still a tree. Matching the  pattern
       to  a given subject string (from a given starting point)
       can be thought of as a search of the  tree.   There  are
       two  ways  to  search  a  tree: depth-first and breadth-
       first, and these correspond to the  two  matching  algo-
       rithms provided by PCRE.

THE STANDARD MATCHING ALGORITHM

       In  the  terminology  of Jeffrey Friedl's book Mastering
       Regular Expressions, the standard algorithm is  an  "NFA
       algorithm". It conducts a depth-first search of the pat-
       tern tree. That is, it  proceeds  along  a  single  path
       through the tree, checking that the subject matches what
       is required. When there is  a  mismatch,  the  algorithm
       tries any alternatives at the current point, and if they
       all fail, it backs up to the previous  branch  point  in
       the  tree, and tries the next alternative branch at that
       level. This often involves backing  up  (moving  to  the
       left)  in the subject string as well. The order in which
       repetition branches  are  tried  is  controlled  by  the
       greedy or ungreedy nature of the quantifier.

       If  a  leaf  node is reached, a matching string has been
       found, and at that point the algorithm stops.  Thus,  if
       there  is  more  than one possible match, this algorithm
       returns the first one that it finds. Whether this is the
       shortest,  the  longest,  or  some  intermediate  length
       depends on the way the greedy  and  ungreedy  repetition
       quantifiers are specified in the pattern.

       Because  it ends up with a single path through the tree,
       it is relatively straightforward for this  algorithm  to
       keep  track  of  the substrings that are matched by por-
       tions of the pattern in parentheses. This provides  sup-
       port for capturing parentheses and back references.

THE ALTERNATIVE MATCHING ALGORITHM

       This  algorithm  conducts  a breadth-first search of the
       tree. Starting from the first matching point in the sub-
       ject,  it  scans  the subject string from left to right,
       once, character by character, and as it  does  this,  it
       remembers  all the paths through the tree that represent
       valid matches. In Friedl's terminology, this is  a  kind
       of  "DFA  algorithm",  though it is not implemented as a
       traditional finite  state  machine  (it  keeps  multiple
       states active simultaneously).

       The  scan  continues until either the end of the subject
       is reached, or there are no more unterminated paths.  At
       this  point,  terminated  paths  represent the different
       matching possibilities (if there are none, the match has
       failed).   Thus,  if  there  is  more  than one possible
       match, this algorithm finds all of them, and in particu-
       lar,  it  finds the longest. In PCRE, there is an option
       to stop the algorithm after the first  match  (which  is
       necessarily the shortest) has been found.

       Note  that  all  the matches that are found start at the
       same point in the subject. If the pattern

         cat(er(pillar)?)

       is matched against the string  "the  caterpillar  catch-
       ment",  the  result  will  be  the  three strings "cat",
       "cater", and "caterpillar"  that  start  at  the  fourth
       character  of  the subject. The algorithm does not auto-
       matically move on to find matches that  start  at  later
       positions.

       There  are  a number of features of PCRE regular expres-
       sions that are not supported by the alternative matching
       algorithm. They are as follows:

       1. Because the algorithm finds all possible matches, the
       greedy or ungreedy nature of repetition  quantifiers  is
       not   relevant.  Greedy  and  ungreedy  quantifiers  are
       treated in exactly the  same  way.  However,  possessive
       quantifiers  can  make  a  difference  when what follows
       could also match what is quantified, for  example  in  a
       pattern like this:

         ^a++\w!

       This pattern matches "aaab!" but not "aaa!", which would
       be matched by a non-possessive quantifier. Similarly, if
       an  atomic group is present, it is matched as if it were
       a standalone pattern  at  the  current  point,  and  the
       longest  match  is  then "locked in" for the rest of the
       overall pattern.

       2. When dealing with multiple  paths  through  the  tree
       simultaneously,  it is not straightforward to keep track
       of captured substrings for the different matching possi-
       bilities,  and  PCRE's  implementation of this algorithm
       does not attempt to do this. This means that no captured
       substrings are available.

       3.  Because  no substrings are captured, back references
       within the pattern are not supported, and  cause  errors
       if encountered.

       4. For the same reason, conditional expressions that use
       a backreference as the condition or test for a  specific
       group recursion are not supported.

       5.  Callouts  are  supported,  but the value of the cap-
       ture_top field is always 1, and the value  of  the  cap-
       ture_last field is always -1.

       6.  The \C escape sequence, which (in the standard algo-
       rithm) matches a single byte, even in UTF-8 mode, is not
       supported   because   the  alternative  algorithm  moves
       through the subject string one character at a time,  for
       all active paths through the tree.

ADVANTAGES OF THE ALTERNATIVE ALGORITHM

       Using  the  alternative  matching algorithm provides the
       following advantages:

       1. All possible matches (at a single point in  the  sub-
       ject)  are  automatically  found, and in particular, the
       longest match is found. To  find  more  than  one  match
       using  the  standard  algorithm,  you  have to do kludgy
       things with callouts.

       2. There is much better support  for  partial  matching.
       The  restrictions  on  the  content  of the pattern that
       apply when using  the  standard  algorithm  for  partial
       matching  do not apply to the alternative algorithm. For
       non-anchored patterns, the starting position of  a  par-
       tial match is available.

       3.  Because  the alternative algorithm scans the subject
       string just once, and never needs to  backtrack,  it  is
       possible to pass very long subject strings to the match-
       ing function in several  pieces,  checking  for  partial
       matching each time.

DISADVANTAGES OF THE ALTERNATIVE ALGORITHM

       The  alternative algorithm suffers from a number of dis-
       advantages:

       1. It is substantially slower than  the  standard  algo-
       rithm.  This  is partly because it has to search for all
       possible matches, but is also because it is less suscep-
       tible to optimization.

       2.  Capturing  parentheses  and  back references are not
       supported.

       3. Although atomic groups are supported, their use  does
       not  provide  the performance advantage that it does for
       the standard algorithm.

Last updated: 24 November 2006
Copyright (c) 1997-2006 University of Cambridge.



                                                               PCREMATCHING(3)
