PCREPERFORM(3)                                                  PCREPERFORM(3)



NAME
       PCRE - Perl-compatible regular expressions

PCRE PERFORMANCE

       Two  aspects  of performance are discussed below: memory
       usage and processing time. The way you express your pat-
       tern as a regular expression can affect both of them.

MEMORY USAGE

       Patterns  are  compiled  by PCRE into a reasonably effi-
       cient byte code, so that most simple patterns do not use
       much  memory.  However,  there  is one case where memory
       usage can be unexpectedly large.  When  a  parenthesized
       subpattern  has a quantifier with a minimum greater than
       1 and/or a limited  maximum,  the  whole  subpattern  is
       repeated in the compiled code. For example, the pattern

         (abc|def){2,4}

       is compiled as if it were

         (abc|def)(abc|def)((abc|def)(abc|def)?)?

       (Technical  aside: It is done this way so that backtrack
       points within each of the repetitions  can  be  indepen-
       dently maintained.)

       For regular expressions whose quantifiers use only small
       numbers, this is not usually a problem. However, if  the
       numbers  are large, and particularly if such repetitions
       are nested, the memory usage can  become  an  embarrass-
       ment. For example, the very simple pattern

         ((ab){1,1000}c){1,3}

       uses 51K bytes when compiled. When PCRE is compiled with
       its default internal pointer size of two bytes, the size
       limit  on a compiled pattern is 64K, and this is reached
       with the  above  pattern  if  the  outer  repetition  is
       increased  from  3  to  4.  PCRE  can be compiled to use
       larger internal pointers and thus handle larger compiled
       patterns,  but  it is better to try to rewrite your pat-
       tern to use less memory if you can.

       One way of reducing the memory usage for  such  patterns
       is to make use of PCRE's "subroutine" facility. Re-writ-
       ing the above pattern as

         ((ab)(?2){0,999}c)(?1){0,2}

       reduces the memory requirements to 18K,  and  indeed  it
       remains   under  20K  even  with  the  outer  repetition
       increased to 100. However, this pattern is  not  exactly
       equivalent,  because  the "subroutine" calls are treated
       as atomic groups into which there can be no backtracking
       if  there  is  a subsequent matching failure. Therefore,
       PCRE cannot do this  kind  of  rewriting  automatically.
       Furthermore,  there  is  a noticeable loss of speed when
       executing the modified  pattern.  Nevertheless,  if  the
       atomic  grouping  is not a problem and the loss of speed
       is acceptable, this kind of rewriting will allow you  to
       process patterns that PCRE cannot otherwise handle.

PROCESSING TIME

       Certain   items   in  regular  expression  patterns  are
       processed more efficiently than others. It is more effi-
       cient  to  use a character class like [aeiou] than a set
       of single-character alternatives such as (a|e|i|o|u). In
       general,  the  simplest  construction  that provides the
       required behaviour is usually the most  efficient.  Jef-
       frey Friedl's book contains a lot of useful general dis-
       cussion about optimizing regular expressions  for  effi-
       cient performance. This document contains a few observa-
       tions about PCRE.

       Using Unicode character properties (the \p, \P,  and  \X
       escapes)  is  slow, because PCRE has to scan a structure
       that contains data for over fifteen thousand  characters
       whenever  it  needs  a  character's property. If you can
       find an alternative pattern that does not use  character
       properties, it will probably be faster.

       When  a pattern begins with .* not in parentheses, or in
       parentheses that are not the subject of a backreference,
       and  the  PCRE_DOTALL  option  is  set,  the  pattern is
       implicitly anchored by PCRE, since it can match only  at
       the  start  of a subject string. However, if PCRE_DOTALL
       is not set, PCRE cannot make this optimization,  because
       the  .  metacharacter does not then match a newline, and
       if the subject string contains newlines, the pattern may
       match  from  the  character immediately following one of
       them instead of from the very start.  For  example,  the
       pattern

         .*second

       matches the subject "first\nand second" (where \n stands
       for a newline character), with the match starting at the
       seventh  character.  In  order  to  do this, PCRE has to
       retry the match starting after every newline in the sub-
       ject.

       If  you  are  using  such a pattern with subject strings
       that do not contain newlines, the  best  performance  is
       obtained by setting PCRE_DOTALL, or starting the pattern
       with ^.* or ^.*? to indicate  explicit  anchoring.  That
       saves PCRE from having to scan along the subject looking
       for a newline to restart at.

       Beware  of  patterns  that  contain  nested   indefinite
       repeats.  These can take a long time to run when applied
       to a string that does not match.  Consider  the  pattern
       fragment

         ^(a+)*

       This  can  match  "aaaa"  in 16 different ways, and this
       number increases very rapidly as the string gets longer.
       (The  * repeat can match 0, 1, 2, 3, or 4 times, and for
       each of those cases other than 0 or 4, the + repeats can
       match different numbers of times.) When the remainder of
       the pattern is such that the entire match  is  going  to
       fail, PCRE has in principle to try every possible varia-
       tion, and this can take an extremely long time, even for
       relatively short strings.

       An  optimization  catches  some of the more simple cases
       such as

         (a+)*b

       where a literal character follows. Before  embarking  on
       the  standard matching procedure, PCRE checks that there
       is a "b" later in the subject string, and  if  there  is
       not, it fails the match immediately. However, when there
       is no following  literal  this  optimization  cannot  be
       used.  You  can  see  the  difference  by  comparing the
       behaviour of

         (a+)*\d

       with the pattern  above.  The  former  gives  a  failure
       almost  instantly  when  applied  to a whole line of "a"
       characters, whereas the latter takes an appreciable time
       with strings longer than about 20 characters.

       In  many cases, the solution to this kind of performance
       issue is to use an atomic group or a possessive  quanti-
       fier.

Last updated: 20 September 2006
Copyright (c) 1997-2006 University of Cambridge.



                                                                PCREPERFORM(3)
