REGEX(7)           Linux Programmer's Manual           REGEX(7)



NAME
       regex - POSIX.2 regular expressions

DESCRIPTION
       Regular expressions ("RE"s), as defined in POSIX.2, come
       in two  forms:  modern  REs  (roughly  those  of  egrep;
       POSIX.2  calls  these  "extended"  REs) and obsolete REs
       (roughly those of ed(1); POSIX.2 "basic" REs).  Obsolete
       REs  mostly exist for backward compatibility in some old
       programs; they will be discussed at  the  end.   POSIX.2
       leaves  some  aspects  of  RE syntax and semantics open;
       "(!)" marks decisions on these aspects that may  not  be
       fully portable to other POSIX.2 implementations.

       A  (modern)  RE is one(!) or more non-empty(!) branches,
       separated by '|'.  It matches anything that matches  one
       of the branches.

       A  branch  is  one(!)  or more pieces, concatenated.  It
       matches a match for the first, followed by a  match  for
       the second, etc.

       A piece is an atom possibly followed by a single(!) '*',
       '+', '?', or bound.  An atom followed by '*'  matches  a
       sequence of 0 or more matches of the atom.  An atom fol-
       lowed by '+' matches a sequence of 1 or more matches  of
       the atom.  An atom followed by '?' matches a sequence of
       0 or 1 matches of the atom.

       A bound is '{' followed by an unsigned decimal  integer,
       possibly  followed  by  ',' possibly followed by another
       unsigned decimal integer, always followed by  '}'.   The
       integers  must  lie  between  0  and RE_DUP_MAX (255(!))
       inclusive, and if there are two of them, the  first  may
       not exceed the second.  An atom followed by a bound con-
       taining one integer i and no comma matches a sequence of
       exactly  i  matches  of the atom.  An atom followed by a
       bound containing one integer i and  a  comma  matches  a
       sequence of i or more matches of the atom.  An atom fol-
       lowed by a bound containing two integers i and j matches
       a  sequence  of  i  through j (inclusive) matches of the
       atom.

       An atom is a regular expression enclosed in "()" (match-
       ing a match for the regular expression), an empty set of
       "()" (matching the null string)(!), a bracket expression
       (see  below),  '.'  (matching any single character), '^'
       (matching the null string at the beginning of  a  line),
       '$'  (matching  the null string at the end of a line), a
       '\' followed by one  of  the  characters  "^.[$()|*+?{\"
       (matching  that  character  taken as an ordinary charac-
       ter), a '\' followed by any other character(!)   (match-
       ing that character taken as an ordinary character, as if
       the '\' had not been present(!)), or a single  character
       with no other significance (matching that character).  A
       '{' followed by a character other than  a  digit  is  an
       ordinary character, not the beginning of a bound(!).  It
       is illegal to end an RE with '\'.

       A bracket expression is a list of characters enclosed in
       "[]".  It normally matches any single character from the
       list (but see below).  If the list begins with  '^',  it
       matches  any  single  character (but see below) not from
       the rest of the list.  If two characters in the list are
       separated  by  '-', this is shorthand for the full range
       of characters between those two (inclusive) in the  col-
       lating  sequence,  for example, "[0-9]" in ASCII matches
       any decimal digit.  It is illegal(!) for two  ranges  to
       share  an  endpoint,  for  example, "a-c-e".  Ranges are
       very collating-sequence-dependent, and portable programs
       should avoid relying on them.

       To  include a literal ']' in the list, make it the first
       character (following a possible '^').  To include a lit-
       eral  '-',  make  it the first or last character, or the
       second endpoint of a range.  To use a literal '-' as the
       first  endpoint  of a range, enclose it in "[." and ".]"
       to make it a collating element (see  below).   With  the
       exception  of these and some combinations using '[' (see
       next paragraphs), all other special characters,  includ-
       ing  '\',  lose  their  special  significance  within  a
       bracket expression.

       Within a bracket  expression,  a  collating  element  (a
       character,  a  multi-character sequence that collates as
       if it were a single character, or  a  collating-sequence
       name  for  either)  enclosed in "[." and ".]" stands for
       the sequence of characters of  that  collating  element.
       The  sequence is a single element of the bracket expres-
       sion's list.  A bracket expression containing  a  multi-
       character collating element can thus match more than one
       character,  for  example,  if  the  collating   sequence
       includes   a   "ch"   collating  element,  then  the  RE
       "[[.ch.]]*c"  matches  the  first  five  characters   of
       "chchcc".

       Within   a   bracket  expression,  a  collating  element
       enclosed in "[="  and  "=]"  is  an  equivalence  class,
       standing  for the sequences of characters of all collat-
       ing elements equivalent to that one,  including  itself.
       (If  there  are  no other equivalent collating elements,
       the treatment is as if  the  enclosing  delimiters  were
       "[." and ".]".)  For example, if o and ^ are the members
       of an equivalence class, then "[[=o=]]", "[[=^=]]",  and
       "[o^]"  are  all  synonymous.   An equivalence class may
       not(!) be an endpoint of a range.

       Within a bracket expression, the  name  of  a  character
       class  enclosed  in "[:" and ":]" stands for the list of
       all characters belonging to that class.  Standard  char-
       acter class names are:

              alnum       digit       punct
              alpha       graph       space
              blank       lower       upper
              cntrl       print       xdigit

       These   stand  for  the  character  classes  defined  in
       wctype(3).  A locale may provide  others.   A  character
       class may not be used as an endpoint of a range.

       In  the  event that an RE could match more than one sub-
       string of a given string, the RE matches the one  start-
       ing  earliest in the string.  If the RE could match more
       than one substring starting at that  point,  it  matches
       the longest.  Subexpressions also match the longest pos-
       sible substrings, subject to  the  constraint  that  the
       whole  match be as long as possible, with subexpressions
       starting earlier in the RE  taking  priority  over  ones
       starting  later.   Note that higher-level subexpressions
       thus take  priority  over  their  lower-level  component
       subexpressions.

       Match  lengths are measured in characters, not collating
       elements.  A null string is considered  longer  than  no
       match at all.  For example, "bb*" matches the three mid-
       dle characters of "abbbc",  "(wee|week)(knights|nights)"
       matches   all   ten  characters  of  "weeknights",  when
       "(.*).*" is  matched  against  "abc"  the  parenthesized
       subexpression  matches  all  three  characters, and when
       "(a*)*" is matched against "bc" both the  whole  RE  and
       the parenthesized subexpression match the null string.

       If case-independent matching is specified, the effect is
       much as if all case distinctions had vanished  from  the
       alphabet.   When  an  alphabetic that exists in multiple
       cases appears as an ordinary character outside a bracket
       expression, it is effectively transformed into a bracket
       expression  containing  both  cases,  for  example,  'x'
       becomes  "[xX]".   When  it  appears  inside  a  bracket
       expression, all case counterparts of it are added to the
       bracket  expression, so that, for example, "[x]" becomes
       "[xX]" and "[^x]" becomes "[^xX]".

       No particular limit is imposed on the length of  REs(!).
       Programs  intended  to be portable should not employ REs
       longer than 256 bytes, as an implementation  can  refuse
       to accept such REs and remain POSIX-compliant.

       Obsolete ("basic") regular expressions differ in several
       respects.  '|', '+', and '?' are ordinary characters and
       there  is  no  equivalent  for their functionality.  The
       delimiters for bounds are "\{" and "\}",  with  '{'  and
       '}'  by themselves ordinary characters.  The parentheses
       for nested subexpressions are "\(" and  "\)",  with  '('
       and  ')'  by  themselves ordinary characters.  '^' is an
       ordinary character except at the  beginning  of  the  RE
       or(!)  the  beginning  of a parenthesized subexpression,
       '$' is an ordinary character except at the end of the RE
       or(!)  the end of a parenthesized subexpression, and '*'
       is an ordinary character if it appears at the  beginning
       of the RE or the beginning of a parenthesized subexpres-
       sion (after a possible leading '^').

       Finally, there is one new type of atom,  a  back  refer-
       ence: '\' followed by a non-zero decimal digit d matches
       the same sequence  of  characters  matched  by  the  dth
       parenthesized subexpression (numbering subexpressions by
       the positions of  their  opening  parentheses,  left  to
       right),  so that, for example, "\([bc]\)\1" matches "bb"
       or "cc" but not "bc".

BUGS
       Having two kinds of REs is a botch.

       The current POSIX.2 spec says that ')'  is  an  ordinary
       character  in  the absence of an unmatched '('; this was
       an unintentional result of a wording error,  and  change
       is likely.  Avoid relying on it.

       Back references are a dreadful botch, posing major prob-
       lems for efficient implementations.  They are also some-
       what   vaguely  defined  (does  "a\(\(b\)*\2\)*d"  match
       "abbbd"?).  Avoid using them.

       POSIX.2's specification of case-independent matching  is
       vague.   The  "one  case  implies  all cases" definition
       given above is current consensus among  implementors  as
       to the right interpretation.

SEE ALSO
       grep(1), regex(3)

       POSIX.2, section 2.8 (Regular Expression Notation).

COLOPHON
       This page is part of release 3.01 of the Linux man-pages
       project.  A description of the project, and  information
       about  reporting  bugs,  can be found at http://www.ker-
       nel.org/doc/man-pages/.



                           2007-12-12                  REGEX(7)
