PCREPOSIX(3)                                                      PCREPOSIX(3)



NAME
       PCRE - Perl-compatible regular expressions.

SYNOPSIS OF POSIX API

       #include <pcreposix.h>

       int regcomp(regex_t *preg, const char *pattern,
            int cflags);

       int regexec(regex_t *preg, const char *string,
            size_t nmatch, regmatch_t pmatch[], int eflags);

       size_t regerror(int errcode, const regex_t *preg,
            char *errbuf, size_t errbuf_size);

       void regfree(regex_t *preg);

DESCRIPTION

       This  set of functions provides a POSIX-style API to the
       PCRE regular expression package. See the  pcreapi  docu-
       mentation  for a description of PCRE's native API, which
       contains much additional functionality.

       The functions described here are just wrapper  functions
       that  ultimately  call the PCRE native API. Their proto-
       types are defined in the pcreposix.h header file, and on
       Unix  systems  the library itself is called pcreposix.a,
       so can be accessed by adding -lpcreposix to the  command
       for  linking  an application that uses them. Because the
       POSIX functions call the native ones, it is also  neces-
       sary to add -lpcre.

       I  have  implemented  only those option bits that can be
       reasonably mapped to PCRE native options.  In  addition,
       the  option REG_EXTENDED is defined with the value zero.
       This has no effect, but since programs that are  written
       to  the POSIX interface often use it, this makes it eas-
       ier to slot in PCRE  as  a  replacement  library.  Other
       POSIX options are not even defined.

       When  PCRE is called via these functions, it is only the
       API that is POSIX-like in style. The syntax  and  seman-
       tics  of  the  regular  expressions themselves are still
       those of Perl, subject to the setting  of  various  PCRE
       options, as described below. "POSIX-like in style" means
       that the API approximates to the POSIX definition; it is
       not  fully  POSIX-compatible, and in multi-byte encoding
       domains it is probably even less compatible.

       The  header  for  these   functions   is   supplied   as
       pcreposix.h  to  avoid  any  potential  clash with other
       POSIX libraries.  It  can,  of  course,  be  renamed  or
       aliased as regex.h, which is the "correct" name. It pro-
       vides two structure types, regex_t for compiled internal
       forms, and regmatch_t for returning captured substrings.
       It also defines some constants whose  names  start  with
       "REG_"; these are used for setting options and identify-
       ing error codes.


COMPILING A PATTERN

       The function regcomp() is called to  compile  a  pattern
       into  an  internal  form.  The  pattern  is  a  C string
       terminated by a binary zero, and is passed in the  argu-
       ment  pattern.  The  preg  argument  is  a  pointer to a
       regex_t structure that is used as  a  base  for  storing
       information about the compiled regular expression.

       The  argument  cflags is either zero, or contains one or
       more of the bits defined by the following macros:

         REG_DOTALL

       The PCRE_DOTALL option is set when the  regular  expres-
       sion  is  passed for compilation to the native function.
       Note that REG_DOTALL is not part of the POSIX  standard.

         REG_ICASE

       The PCRE_CASELESS option is set when the regular expres-
       sion is passed for compilation to the native function.

         REG_NEWLINE

       The  PCRE_MULTILINE  option  is  set  when  the  regular
       expression is passed for compilation to the native func-
       tion. Note that this does not mimic  the  defined  POSIX
       behaviour for REG_NEWLINE (see the following section).

         REG_NOSUB

       The  PCRE_NO_AUTO_CAPTURE option is set when the regular
       expression is passed for compilation to the native func-
       tion.  In addition, when a pattern that is compiled with
       this flag is  passed  to  regexec()  for  matching,  the
       nmatch and pmatch arguments are ignored, and no captured
       strings are returned.

         REG_UTF8

       The PCRE_UTF8 option is set when the regular  expression
       is  passed  for compilation to the native function. This
       causes the pattern itself and all data strings used  for
       matching  it  to  be treated as UTF-8 strings. Note that
       REG_UTF8 is not part of the POSIX standard.

       In the absence of these flags, no options are passed  to
       the  native  function.  This means the the regex is com-
       piled with PCRE default semantics.  In  particular,  the
       way  it handles newline characters in the subject string
       is the Perl way, not the POSIX way.  Note  that  setting
       PCRE_MULTILINE  has  only  some of the effects specified
       for REG_NEWLINE. It does not affect the way newlines are
       matched  by  . (they aren't) or by a negative class such
       as [^a] (they are).

       The yield of regcomp() is zero on success, and  non-zero
       otherwise.  The  preg structure is filled in on success,
       and one member of the structure is public: re_nsub  con-
       tains the number of capturing subpatterns in the regular
       expression. Various  error  codes  are  defined  in  the
       header file.

MATCHING NEWLINE CHARACTERS

       This  area  is  not  simple, because POSIX and Perl take
       different views of things.  It is not  possible  to  get
       PCRE  to  obey  POSIX semantics, but then PCRE was never
       intended to be a POSIX engine. The following table lists
       the different possibilities for matching newline charac-
       ters in PCRE:

                                 Default   Change with

         . matches newline          no     PCRE_DOTALL
         newline matches [^a]       yes    not changeable
         $ matches \n at end        yes    PCRE_DOLLARENDONLY
         $ matches \n in middle     no     PCRE_MULTILINE
         ^ matches \n in middle     no     PCRE_MULTILINE

       This is the equivalent table for POSIX:

                                 Default   Change with

         . matches newline          yes    REG_NEWLINE
         newline matches [^a]       yes    REG_NEWLINE
         $ matches \n at end        no     REG_NEWLINE
         $ matches \n in middle     no     REG_NEWLINE
         ^ matches \n in middle     no     REG_NEWLINE

       PCRE's behaviour is the  same  as  Perl's,  except  that
       there  is no equivalent for PCRE_DOLLAR_ENDONLY in Perl.
       In both PCRE and Perl, there is no way to  stop  newline
       from matching [^a].

       The  default  POSIX  newline handling can be obtained by
       setting PCRE_DOTALL and PCRE_DOLLAR_ENDONLY,  but  there
       is  no  way  to  make  PCRE  behave  exactly  as for the
       REG_NEWLINE action.

MATCHING A PATTERN

       The function regexec() is called  to  match  a  compiled
       pattern preg against a given string, which is terminated
       by a zero byte, subject to the options in eflags.  These
       can be:

         REG_NOTBOL

       The  PCRE_NOTBOL option is set when calling the underly-
       ing PCRE matching function.

         REG_NOTEOL

       The PCRE_NOTEOL option is set when calling the  underly-
       ing PCRE matching function.

       If  the pattern was compiled with the REG_NOSUB flag, no
       data about any matched strings is returned.  The  nmatch
       and pmatch arguments of regexec() are ignored.

       Otherwise,the  portion  of  the string that was matched,
       and also any captured substrings, are returned  via  the
       pmatch  argument,  which  points  to  an array of nmatch
       structures of type regmatch_t,  containing  the  members
       rm_so  and  rm_eo. These contain the offset to the first
       character of each substring and the offset to the  first
       character after the end of each substring, respectively.
       The 0th element of the vector relates to the entire por-
       tion  of  string  that  was matched; subsequent elements
       relate to  the  capturing  subpatterns  of  the  regular
       expression. Unused entries in the array have both struc-
       ture members set to -1.

       A successful match yields a zero return;  various  error
       codes   are   defined  in  the  header  file,  of  which
       REG_NOMATCH is the "expected" failure code.

ERROR MESSAGES

       The regerror() function maps a non-zero  errorcode  from
       either regcomp() or regexec() to a printable message. If
       preg is not NULL, the error should have arisen from  the
       use  of that structure. A message terminated by a binary
       zero is placed in errbuf. The  length  of  the  message,
       including the zero, is limited to errbuf_size. The yield
       of the function is the size of buffer needed to hold the
       whole message.

MEMORY USAGE

       Compiling a regular expression causes memory to be allo-
       cated and associated with the preg structure. The  func-
       tion  regfree()  frees all such memory, after which preg
       may no longer be used as a compiled expression.

AUTHOR

       Philip Hazel
       University Computing Service,
       Cambridge CB2 3QH, England.

Last updated: 16 January 2006
Copyright (c) 1997-2006 University of Cambridge.



                                                                  PCREPOSIX(3)
