REGCOMP(3P)        POSIX Programmer's Manual        REGCOMP(3P)



PROLOG
       This  manual page is part of the POSIX Programmer's Man-
       ual.  The Linux implementation  of  this  interface  may
       differ  (consult the corresponding Linux manual page for
       details of Linux behavior), or the interface may not  be
       implemented on Linux.

NAME
       regcomp, regerror, regexec, regfree - regular expression
       matching

SYNOPSIS
       #include <regex.h>

       int regcomp(regex_t *restrict preg, const char *restrict
       pattern,
              int cflags);
       size_t  regerror(int  errcode,  const  regex_t *restrict
       preg,
              char *restrict errbuf, size_t errbuf_size);
       int regexec(const regex_t  *restrict  preg,  const  char
       *restrict string,
               size_t  nmatch, regmatch_t pmatch[restrict], int
       eflags);
       void regfree(regex_t *preg);


DESCRIPTION
       These functions interpret  basic  and  extended  regular
       expressions  as described in the Base Definitions volume
       of IEEE Std 1003.1-2001, Chapter 9, Regular Expressions.

       The  regex_t  structure is defined in <regex.h> and con-
       tains at least the following member:

   Member Type  Member Name  Description
   size_t       re_nsub      Number of parenthesized subexpressions.

       The regmatch_t structure is  defined  in  <regex.h>  and
       contains at least the following members:

    Member Type Member Name Description
    regoff_t    rm_so       Byte offset from start of string to
                            start of substring.
    regoff_t    rm_eo       Byte offset from start of string of the
                            first character after the end of sub-
                            string.

       The regcomp() function shall compile the regular expres-
       sion  contained  in the string pointed to by the pattern
       argument and place the results in the structure  pointed
       to  by  preg.  The cflags argument is the bitwise-inclu-
       sive OR of zero or more of the  following  flags,  which
       are defined in the <regex.h> header:

       REG_EXTENDED
              Use Extended Regular Expressions.

       REG_ICASE
              Ignore  case  in match. (See the Base Definitions
              volume of IEEE Std 1003.1-2001, Chapter 9,  Regu-
              lar Expressions.)

       REG_NOSUB
              Report only success/fail in regexec().

       REG_NEWLINE
              Change  the  handling of <newline>s, as described
              in the text.


       The default regular expression type  for  pattern  is  a
       Basic  Regular  Expression.  The application can specify
       Extended  Regular  Expressions  using  the  REG_EXTENDED
       cflags flag.

       If  the  REG_NOSUB flag was not set in cflags, then reg-
       comp() shall set re_nsub to the number of  parenthesized
       subexpressions  (delimited  by  "\(\)"  in basic regular
       expressions or "()"  in  extended  regular  expressions)
       found in pattern.

       The  regexec()  function  compares  the  null-terminated
       string specified by string  with  the  compiled  regular
       expression  preg  initialized by a previous call to reg-
       comp().  If it finds a match, regexec() shall return  0;
       otherwise, it shall return non-zero indicating either no
       match or an error. The eflags argument is  the  bitwise-
       inclusive  OR  of  zero  or more of the following flags,
       which are defined in the <regex.h> header:

       REG_NOTBOL
              The first character of the string pointed  to  by
              string  is  not the beginning of the line. There-
              fore, the circumflex  character  (  '^'  ),  when
              taken as a special character, shall not match the
              beginning of string.

       REG_NOTEOL
              The last character of the string  pointed  to  by
              string is not the end of the line. Therefore, the
              dollar sign ( '$' ),  when  taken  as  a  special
              character, shall not match the end of string.


       If  nmatch is 0 or REG_NOSUB was set in the cflags argu-
       ment to  regcomp(),  then  regexec()  shall  ignore  the
       pmatch argument. Otherwise, the application shall ensure
       that the pmatch argument points  to  an  array  with  at
       least  nmatch  elements, and regexec() shall fill in the
       elements of that array with offsets of the substrings of
       string  that  correspond to the parenthesized subexpres-
       sions of pattern: pmatch[ i]. rm_so shall  be  the  byte
       offset  of  the beginning and pmatch[ i]. rm_eo shall be
       one greater than the byte offset of the end of substring
       i.  (Subexpression  i  begins  at  the  ith matched open
       parenthesis, counting  from  1.)  Offsets  in  pmatch[0]
       identify  the  substring  that corresponds to the entire
       regular expression. Unused  elements  of  pmatch  up  to
       pmatch[  nmatch-1] shall be filled with -1. If there are
       more than nmatch subexpressions  in  pattern  (  pattern
       itself  counts as a subexpression), then regexec() shall
       still do the match, but  shall  record  only  the  first
       nmatch substrings.

       When  matching  a  basic or extended regular expression,
       any given parenthesized subexpression of  pattern  might
       participate in the match of several different substrings
       of string, or it might  not  match  any  substring  even
       though  the  pattern as a whole did match. The following
       rules shall be used to  determine  which  substrings  to
       report in pmatch when matching regular expressions:

        1. If  subexpression  i  in a regular expression is not
           contained within another subexpression, and it  par-
           ticipated  in the match several times, then the byte
           offsets in pmatch[ i] shall delimit  the  last  such
           match.


        2. If  subexpression  i is not contained within another
           subexpression, and it did not participate in an oth-
           erwise successful match, the byte offsets in pmatch[
           i] shall be -1. A subexpression does not participate
           in the match when: '*' or "\{\}" appears immediately
           after the subexpression in a basic  regular  expres-
           sion, or '*', '?', or "{}" appears immediately after
           the subexpression in an extended regular expression,
           and  the  subexpression  did  not  match  (matched 0
           times)

       or: '|' is used in an  extended  regular  expression  to
       select  this  subexpression  or  another,  and the other
       subexpression matched.


        3. If  subexpression  i  is  contained  within  another
           subexpression  j,  and i is not contained within any
           other subexpression that is contained within j,  and
           a  match  of  subexpression j is reported in pmatch[
           j], then the match or non-match of  subexpression  i
           reported  in  pmatch[ i] shall be as described in 1.
           and 2.  above, but within the substring reported  in
           pmatch[ j] rather than the whole string. The offsets
           in pmatch[ i] are still relative  to  the  start  of
           string.


        4. If  subexpression i is contained in subexpression j,
           and the byte offsets in pmatch[ j] are -1, then  the
           pointers in pmatch[ i] shall also be -1.


        5. If  subexpression  i  matched  a zero-length string,
           then both byte offsets in pmatch[ i]  shall  be  the
           byte  offset  of  the  character  or null terminator
           immediately following the zero-length string.


       If, when regexec() is called, the  locale  is  different
       from  when  the  regular  expression  was  compiled, the
       result is undefined.

       If REG_NEWLINE is not set in cflags, then a <newline> in
       pattern  or string shall be treated as an ordinary char-
       acter. If REG_NEWLINE is set, then  <newline>  shall  be
       treated as an ordinary character except as follows:

        1. A  <newline>  in  string  shall  not be matched by a
           period outside a bracket expression or by  any  form
           of  a  non-matching  list  (see the Base Definitions
           volume of IEEE Std 1003.1-2001, Chapter  9,  Regular
           Expressions).


        2. A  circumflex ( '^' ) in pattern, when used to spec-
           ify expression anchoring (see the  Base  Definitions
           volume  of  IEEE Std 1003.1-2001, Section 9.3.8, BRE
           Expression Anchoring), shall match  the  zero-length
           string  immediately  after  a  <newline>  in string,
           regardless of the setting of REG_NOTBOL.


        3. A dollar sign ( '$' ) in pattern, when used to spec-
           ify  expression  anchoring,  shall  match  the zero-
           length string  immediately  before  a  <newline>  in
           string, regardless of the setting of REG_NOTEOL.


       The  regfree()  function  frees  any memory allocated by
       regcomp() associated with preg.

       The following constants are defined as error return val-
       ues:

       REG_NOMATCH
              regexec() failed to match.

       REG_BADPAT
              Invalid regular expression.

       REG_ECOLLATE
              Invalid collating element referenced.

       REG_ECTYPE
              Invalid character class type referenced.

       REG_EESCAPE
              Trailing '\' in pattern.

       REG_ESUBREG
              Number in "\digit" invalid or in error.

       REG_EBRACK
              "[]" imbalance.

       REG_EPAREN
              "\(\)" or "()" imbalance.

       REG_EBRACE
              "\{\}" imbalance.

       REG_BADBR
              Content  of  "\{\}" invalid: not a number, number
              too large, more than two  numbers,  first  larger
              than second.

       REG_ERANGE
              Invalid endpoint in range expression.

       REG_ESPACE
              Out of memory.

       REG_BADRPT
              '?',  '*',  or  '+' not preceded by valid regular
              expression.


       The regerror() function provides a  mapping  from  error
       codes returned by regcomp() and regexec() to unspecified
       printable strings. It generates a  string  corresponding
       to the value of the errcode argument, which the applica-
       tion shall ensure is the last non-zero value returned by
       regcomp()  or regexec() with the given value of preg. If
       errcode is not such a value, the content of  the  gener-
       ated string is unspecified.

       If  preg  is  a  null  pointer,  but  errcode is a value
       returned by a previous call to regexec()  or  regcomp(),
       the  regerror()  still  generates an error string corre-
       sponding to the value of errcode, but it might not be as
       detailed under some implementations.

       If  the  errbuf_size argument is not 0, regerror() shall
       place the generated  string  into  the  buffer  of  size
       errbuf_size  bytes  pointed  to by errbuf. If the string
       (including the  terminating  null)  cannot  fit  in  the
       buffer,  regerror()  shall truncate the string and null-
       terminate the result.

       If errbuf_size is 0, regerror() shall ignore the  errbuf
       argument,  and  return  the size of the buffer needed to
       hold the generated string.

       If the preg argument to regexec() or regfree() is not  a
       compiled  regular  expression returned by regcomp(), the
       result is undefined. A preg is no longer  treated  as  a
       compiled   regular  expression  after  it  is  given  to
       regfree().

RETURN VALUE
       Upon successful completion, the regcomp() function shall
       return  0.  Otherwise,  it shall return an integer value
       indicating an error as described in <regex.h>,  and  the
       content of preg is undefined. If a code is returned, the
       interpretation shall be as given in <regex.h>.

       If regcomp()  detects  an  invalid  RE,  it  may  return
       REG_BADPAT, or it may return one of the error codes that
       more precisely describes the error.

       Upon successful completion, the regexec() function shall
       return  0.  Otherwise,  it  shall  return REG_NOMATCH to
       indicate no match.

       Upon  successful  completion,  the  regerror()  function
       shall  return  the  number  of  bytes needed to hold the
       entire generated string, including the null termination.
       If  the  return  value  is greater than errbuf_size, the
       string returned in the buffer pointed to by  errbuf  has
       been truncated.

       The regfree() function shall not return a value.

ERRORS
       No errors are defined.

       The following sections are informative.

EXAMPLES
              #include <regex.h>


              /*
               * Match string against the extended regular expression in
               * pattern, treating errors as no match.
               *
               * Return 1 for match, 0 for no match.
               */


              int
              match(const char *string, char *pattern)
              {
                  int    status;
                  regex_t    re;


                  if (regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB) != 0) {
                      return(0);      /* Report error. */
                  }
                  status = regexec(&re, string, (size_t) 0, NULL, 0);
                  regfree(&re);
                  if (status != 0) {
                      return(0);      /* Report error. */
                  }
                  return(1);
              }

       The following demonstrates how the REG_NOTBOL flag could
       be used with regexec() to find all substrings in a  line
       that match a pattern supplied by a user. (For simplicity
       of the example, very little error checking is done.)


              (void) regcomp (&re, pattern, 0);
              /* This call to regexec() finds the first match on the line. */
              error = regexec (&re, &buffer[0], 1, &pm, 0);
              while (error == 0) {  /* While matches found. */
                  /* Substring found between pm.rm_so and pm.rm_eo. */
                  /* This call to regexec() finds the next match. */
                  error = regexec (&re, buffer + pm.rm_eo, 1, &pm, REG_NOTBOL);
              }

APPLICATION USAGE
       An application could use:


              regerror(code,preg,(char *)NULL,(size_t)0)

       to find out how big a buffer is needed for the generated
       string,  malloc()  a buffer to hold the string, and then
       call regerror() again to get the string.  Alternatively,
       it  could  allocate  a  fixed, static buffer that is big
       enough to hold most strings, and then  use  malloc()  to
       allocate  a  larger  buffer if it finds that this is too
       small.

       To match a pattern as described in the Shell and  Utili-
       ties  volume of IEEE Std 1003.1-2001, Section 2.13, Pat-
       tern Matching Notation, use the fnmatch() function.

RATIONALE
       The regexec() function must fill in all nmatch  elements
       of  pmatch,  where nmatch and pmatch are supplied by the
       application, even if some elements of pmatch do not cor-
       respond  to  subexpressions  in pattern. The application
       writer should note that there is probably no reason  for
       using  a  value  of  nmatch  that  is larger than preg->
       re_nsub+1.

       The REG_NEWLINE flag supports a use of RE matching  that
       is  needed  in  some  applications like text editors. In
       such applications, the user supplies an  RE  asking  the
       application  to  find  a  line  that  matches  the given
       expression. An anchor in  such  an  RE  anchors  at  the
       beginning  or  end  of any line. Such an application can
       pass  a  sequence  of   <newline>-separated   lines   to
       regexec()  as  a single long string and specify REG_NEW-
       LINE to regcomp()  to  get  the  desired  behavior.  The
       application must ensure that there are no explicit <new-
       line>s in pattern if it wants to ensure that  any  match
       occurs entirely within a single line.

       The  REG_NEWLINE flag affects the behavior of regexec(),
       but it is in the cflags parameter to regcomp() to  allow
       flexibility of implementation. Some implementations will
       want to generate  the  same  compiled  RE  in  regcomp()
       regardless  of  the  setting  of  REG_NEWLINE  and  have
       regexec() handle anchors differently based on  the  set-
       ting  of  the  flag. Other implementations will generate
       different compiled REs based on the REG_NEWLINE.

       The REG_ICASE flag supports the operations taken by  the
       grep  -i option and the historical implementations of ex
       and vi.  Including this flag will  make  it  easier  for
       application  code to be written that does the same thing
       as these utilities.

       The substrings reported in pmatch[]  are  defined  using
       offsets  from the start of the string rather than point-
       ers. Since this is a new interface, there should  be  no
       impact  on  historical  implementations or applications,
       and offsets should be just as easy to use  as  pointers.
       The  change  to  offsets  was  made to facilitate future
       extensions in which the string to be  searched  is  pre-
       sented  to  regexec() in blocks, allowing a string to be
       searched that is not all in memory at once.

       The type regoff_t is used for the elements  of  pmatch[]
       to  ensure that the application can represent either the
       largest possible  array  in  memory  (important  for  an
       application conforming to the Shell and Utilities volume
       of IEEE Std 1003.1-2001) or the  largest  possible  file
       (important  for an application using the extension where
       a file is searched in chunks).

       The standard developers rejected the inclusion of a reg-
       sub()  function  that  would be used to do substitutions
       for a matched RE. While such a routine would  be  useful
       to  some  applications,  its  utility would be much more
       limited than the matching function described here.  Both
       RE  parsing  and  substitution are possible to implement
       without support other than that required  by  the  ISO C
       standard, but matching is much more complex than substi-
       tuting.  The only difficult part of substitution,  given
       the  information  supplied  by regexec(), is finding the
       next character in a string when there can be  multi-byte
       characters.  That  is  a much larger issue, and one that
       needs a more general solution.

       The errno variable has not been used for  error  returns
       to  avoid filling the errno name space for this feature.

       The interface is defined so that the matched  substrings
       rm_sp  and  rm_ep are in a separate regmatch_t structure
       instead of in regex_t. This allows a single compiled  RE
       to be used simultaneously in several contexts; in main()
       and a signal handler, perhaps, or in multiple threads of
       lightweight  processes.  (The preg argument to regexec()
       is declared with type const, so  the  implementation  is
       not permitted to use the structure to store intermediate
       results.) It also allows an application  to  request  an
       arbitrary number of substrings from an RE. The number of
       subexpressions in the RE is reported in re_nsub in preg.
       With  this  change to regexec(), consideration was given
       to dropping the REG_NOSUB flag since the  user  can  now
       specify  this  with a zero nmatch argument to regexec().
       However, keeping REG_NOSUB allows an  implementation  to
       use a different (perhaps more efficient) algorithm if it
       knows  in  regcomp()  that  no  subexpressions  need  be
       reported. The implementation is only required to fill in
       pmatch if nmatch is not zero and  if  REG_NOSUB  is  not
       specified.  Note that the size_t type, as defined in the
       ISO C standard,  is  unsigned,  so  the  description  of
       regexec()  does  not  need to address negative values of
       nmatch.

       REG_NOTBOL was added  to  allow  an  application  to  do
       repeated searches for the same pattern in a line. If the
       pattern contains  a  circumflex  character  that  should
       match  the  beginning of a line, then the pattern should
       only match when matched against  the  beginning  of  the
       line. Without the REG_NOTBOL flag, the application could
       rewrite the expression for subsequent  matches,  but  in
       the  general case this would require parsing the expres-
       sion. The need for REG_NOTEOL is not as  clear;  it  was
       added for symmetry.

       The  addition  of  the regerror() function addresses the
       historical need for conforming application  programs  to
       have  access  to  error  information more than "Function
       failed to compile/match your RE for unknown reasons".

       This interface provides for  two  different  methods  of
       dealing  with error conditions. The specific error codes
       (REG_EBRACE, for example), defined in  <regex.h>,  allow
       an  application  to  recover  from  an error if it is so
       able. Many applications, especially those that use  pat-
       terns supplied by a user, will not try to deal with spe-
       cific error cases,  but  will  just  use  regerror()  to
       obtain  a human-readable error message to present to the
       user.

       The regerror() function uses a scheme similar  to  conf-
       str()  to  deal with the problem of allocating memory to
       hold the generated string. The scheme used by strerror()
       in  the ISO C standard was considered unacceptable since
       it creates difficulties for multi-threaded applications.

       The  preg argument is provided to regerror() to allow an
       implementation to generate a  more  descriptive  message
       than  would be possible with errcode alone. An implemen-
       tation might, for example, save the character offset  of
       the  offending  character  of  the pattern in a field of
       preg, and then include that  in  the  generated  message
       string. The implementation may also ignore preg.

       A  REG_FILENAME  flag  was considered, but omitted. This
       flag caused regexec() to match patterns as described  in
       the  Shell and Utilities volume of IEEE Std 1003.1-2001,
       Section 2.13, Pattern Matching Notation instead of  REs.
       This  service is now provided by the fnmatch() function.

       Notice that there is a difference in philosophy  between
       the  ISO POSIX-2:1993  standard and IEEE Std 1003.1-2001
       in  how  to  handle  a  "bad"  regular  expression.  The
       ISO POSIX-2:1993  standard says that many bad constructs
       "produce undefined results", or that "the interpretation
       is  undefined". IEEE Std 1003.1-2001, however, says that
       the interpretation of such REs is unspecified. The  term
       "undefined"  means that the action by the application is
       an error, of similar severity to passing a  bad  pointer
       to a function.

       The  regcomp()  and  regexec() functions are required to
       accept any null-terminated string as the  pattern  argu-
       ment.  If  the meaning of the string is "undefined", the
       behavior   of    the    function    is    "unspecified".
       IEEE Std 1003.1-2001  does not specify how the functions
       will interpret the  pattern;  they  might  return  error
       codes,  or  they  might do pattern matching in some com-
       pletely unexpected way, but they should not do something
       like abort the process.

FUTURE DIRECTIONS
       None.

SEE ALSO
       fnmatch(),   glob(),   Shell  and  Utilities  volume  of
       IEEE Std 1003.1-2001,  Section  2.13,  Pattern  Matching
       Notation,      Base      Definitions      volume      of
       IEEE Std 1003.1-2001, Chapter  9,  Regular  Expressions,
       <regex.h>, <sys/types.h>

COPYRIGHT
       Portions  of  this  text are reprinted and reproduced in
       electronic form from  IEEE  Std  1003.1,  2003  Edition,
       Standard  for Information Technology -- Portable Operat-
       ing System Interface (POSIX), The Open Group Base Speci-
       fications Issue 6, Copyright (C) 2001-2003 by the Insti-
       tute of Electrical and Electronics  Engineers,  Inc  and
       The  Open Group. In the event of any discrepancy between
       this version and the original IEEE and  The  Open  Group
       Standard,  the original IEEE and The Open Group Standard
       is the referee document. The original  Standard  can  be
       obtained        online        at        http://www.open-
       group.org/unix/online.html .



IEEE/The Open Group           2003                  REGCOMP(3P)
