PCRE(3)                                                                PCRE(3)



NAME
       PCRE - Perl-compatible regular expressions

INTRODUCTION

       The  PCRE  library  is a set of functions that implement
       regular expression pattern matching using the same  syn-
       tax  and semantics as Perl, with just a few differences.
       (Certain features  that  appeared  in  Python  and  PCRE
       before  they  appeared  in Perl are also available using
       the Python syntax.)

       The current implementation of PCRE (release 7.x)  corre-
       sponds  approximately  with Perl 5.10, including support
       for UTF-8 encoded strings and Unicode  general  category
       properties. However, UTF-8 and Unicode support has to be
       explicitly enabled; it is not the default.  The  Unicode
       tables correspond to Unicode release 5.0.0.

       In  addition  to  the Perl-compatible matching function,
       PCRE contains  an  alternative  matching  function  that
       matches  the  same compiled patterns in a different way.
       In certain circumstances, the alternative  function  has
       some  advantages.  For  a discussion of the two matching
       algorithms, see the pcrematching page.

       PCRE is written in C and released as a C library. A num-
       ber  of  people  have written wrappers and interfaces of
       various kinds. In particular, Google Inc.  have provided
       a  comprehensive  C++  wrapper.  This is now included as
       part of the PCRE  distribution.  The  pcrecpp  page  has
       details  of this interface. Other people's contributions
       can be found in the Contrib directory at the primary FTP
       site, which is:

       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre

       Details  of  exactly  which Perl regular expression fea-
       tures are and are not supported by  PCRE  are  given  in
       separate  documents.  See the pcrepattern and pcrecompat
       pages.

       Some features of PCRE  can  be  included,  excluded,  or
       changed  when  the  library  is built. The pcre_config()
       function makes it possible  for  a  client  to  discover
       which  features  are  available. The features themselves
       are described in the pcrebuild page. Documentation about
       building PCRE for various operating systems can be found
       in the README file in the source distribution.

       The library contains a number of  undocumented  internal
       functions and data tables that are used by more than one
       of the exported external functions, but  which  are  not
       intended  for  use  by external callers. Their names all
       begin with "_pcre_", which hopefully  will  not  provoke
       any  name  clashes. In some environments, it is possible
       to control which external symbols are  exported  when  a
       shared  library is built, and in these cases the undocu-
       mented symbols are not exported.

USER DOCUMENTATION

       The user documentation for PCRE comprises  a  number  of
       different  sections.  In the "man" format, each of these
       is a separate "man page". In the HTML format, each is  a
       separate  page, linked from the index page. In the plain
       text format, all the sections are concatenated, for ease
       of searching. The sections are as follows:

         pcre              this document
         pcreapi           details of PCRE's native C API
         pcrebuild         options for building PCRE
         pcrecallout       details of the callout feature
         pcrecompat        discussion of Perl compatibility
         pcrecpp           details of the C++ wrapper
         pcregrep          description of the pcregrep command
         pcrematching      discussion of the two matching algo-
       rithms
         pcrepartial        details  of  the  partial  matching
       facility
         pcrepattern       syntax and semantics of supported
                             regular expressions
         pcreperform       discussion of performance issues
         pcreposix         the POSIX-compatible C API
         pcreprecompile     details of saving and re-using pre-
       compiled patterns
         pcresample        discussion of the sample program
         pcrestack         discussion of stack usage
         pcretest          description of the pcretest  testing
       command

       In  addition,  in the "man" and HTML formats, there is a
       short page for each  C  library  function,  listing  its
       arguments and results.

LIMITATIONS

       There  are some size limitations in PCRE but it is hoped
       that they will never in practice be relevant.

       The maximum length of a compiled pattern is 65539  (sic)
       bytes  if  PCRE  is  compiled  with the default internal
       linkage size of  2.  If  you  want  to  process  regular
       expressions  that  are  truly  enormous, you can compile
       PCRE with an internal linkage size of 3 or  4  (see  the
       README file in the source distribution and the pcrebuild
       documentation for details). In these cases the limit  is
       substantially  larger.   However, the speed of execution
       is slower.

       All values in repeating quantifiers must  be  less  than
       65536. The maximum compiled length of subpattern with an
       explicit repeat count is 30000 bytes. The maximum number
       of capturing subpatterns is 65535.

       There is no limit to the number of parenthesized subpat-
       terns, but there can be no  more  than  65535  capturing
       subpatterns.

       The  maximum length of name for a named subpattern is 32
       characters, and the maximum number of named  subpatterns
       is 10000.

       The  maximum  length  of a subject string is the largest
       positive number that an integer variable can hold.  How-
       ever, when using the traditional matching function, PCRE
       uses recursion to handle subpatterns and indefinite rep-
       etition.   This means that the available stack space may
       limit the size of a subject string that can be processed
       by  certain  patterns. For a discussion of stack issues,
       see the pcrestack documentation.


UTF-8 AND UNICODE PROPERTY SUPPORT

       From  release  3.3,  PCRE  has  had  some  support   for
       character  strings  encoded  in  the  UTF-8  format. For
       release 4.0 this was greatly extended to cover most com-
       mon  requirements, and in release 5.0 additional support
       for Unicode general category properties was added.

       In order process UTF-8 strings, you must build  PCRE  to
       include UTF-8 support in the code, and, in addition, you
       must call pcre_compile() with the PCRE_UTF8 option flag.
       When  you  do  this,  both  the  pattern and any subject
       strings that are matched against it are treated as UTF-8
       strings instead of just strings of bytes.

       If  you  compile PCRE with UTF-8 support, but do not use
       it at run time, the library will be a  bit  bigger,  but
       the  additional  run time overhead is limited to testing
       the PCRE_UTF8 flag occasionally, so should not  be  very
       big.

       If PCRE is built with Unicode character property support
       (which implies  UTF-8  support),  the  escape  sequences
       \p{..},  \P{..},  and  \X  are supported.  The available
       properties that can be tested are limited to the general
       category  properties such as Lu for an upper case letter
       or Nd for a decimal number,  the  Unicode  script  names
       such  as  Arabic  or Han, and the derived properties Any
       and L&. A full list is given in the pcrepattern documen-
       tation.  Only  the  short  names for properties are sup-
       ported. For example, \p{L} matches a  letter.  Its  Perl
       synonym,  \p{Letter}, is not supported.  Furthermore, in
       Perl, many properties  may  optionally  be  prefixed  by
       "Is",  for  compatibility  with  Perl 5.6. PCRE does not
       support this.

       The following comments apply when  PCRE  is  running  in
       UTF-8 mode:

       1.  When  you set the PCRE_UTF8 flag, the strings passed
       as patterns and subjects are  checked  for  validity  on
       entry  to  the  relevant  functions. If an invalid UTF-8
       string is passed, an error return is given. In some sit-
       uations,  you  may  already  know  that your strings are
       valid, and therefore want to skip these checks in  order
       to    improve    performance.    If    you    set    the
       PCRE_NO_UTF8_CHECK flag at compile time or at run  time,
       PCRE  assumes  that  the  pattern or subject it is given
       (respectively) contains only valid UTF-8 codes. In  this
       case,  it  does not diagnose an invalid UTF-8 string. If
       you  pass  an  invalid  UTF-8  string   to   PCRE   when
       PCRE_NO_UTF8_CHECK  is  set,  the results are undefined.
       Your program may crash.

       2. An unbraced  hexadecimal  escape  sequence  (such  as
       \xb3) matches a two-byte UTF-8 character if the value is
       greater than 127.

       3. Octal numbers up to \777 are  recognized,  and  match
       two-byte  UTF-8 characters for values greater than \177.

       4. Repeat quantifiers apply to  complete  UTF-8  charac-
       ters,  not to individual bytes, for example: \x{100}{3}.

       5. The dot metacharacter  matches  one  UTF-8  character
       instead of a single byte.

       6.  The escape sequence \C can be used to match a single
       byte in UTF-8 mode, but its use can lead to some strange
       effects.  This facility is not available in the alterna-
       tive matching function, pcre_dfa_exec().

       7. The character escapes \b, \B, \d, \D, \s, \S, \w, and
       \W  correctly test characters of any code value, but the
       characters that PCRE recognizes as  digits,  spaces,  or
       word  characters remain the same set as before, all with
       values less than 256. This remains true even  when  PCRE
       includes  Unicode property support, because to do other-
       wise would slow down PCRE in many common cases.  If  you
       really  want to test for a wider sense of, say, "digit",
       you must use Unicode property tests such as \p{Nd}.

       8. Similarly, characters  that  match  the  POSIX  named
       character classes are all low-valued characters.

       9.  Case-insensitive matching applies only to characters
       whose values are less than 128,  unless  PCRE  is  built
       with  Unicode  property support. Even when Unicode prop-
       erty support is available, PCRE still uses its own char-
       acter  tables when checking the case of low-valued char-
       acters, so as not to degrade performance.   The  Unicode
       property  information  is  used only for characters with
       higher values. Even when  Unicode  property  support  is
       available,  PCRE supports case-insensitive matching only
       when there is a one-to-one mapping  between  a  letter's
       cases.  There are a small number of many-to-one mappings
       in Unicode; these are not supported by PCRE.

AUTHOR

       Philip Hazel
       University Computing Service,
       Cambridge CB2 3QH, England.

       Putting an actual email address here seems to have  been
       a  spam  magnet,  so  I've taken it away. If you want to
       email me, use my initial and  surname,  separated  by  a
       dot, at the domain ucs.cam.ac.uk.

Last updated: 23 November 2006
Copyright (c) 1997-2006 University of Cambridge.



                                                                       PCRE(3)
