PCREPARTIAL(3)                                                  PCREPARTIAL(3)



NAME
       PCRE - Perl-compatible regular expressions

PARTIAL MATCHING IN PCRE

       In  normal  use  of  PCRE, if the subject string that is
       passed to pcre_exec() or pcre_dfa_exec() matches as  far
       as  it  goes,  but is too short to match the entire pat-
       tern, PCRE_ERROR_NOMATCH is returned. There are  circum-
       stances  where  it  might be helpful to distinguish this
       case from other cases in which there is no match.

       Consider, for example, an application where a  human  is
       required  to type in data for a field with specific for-
       matting requirements. An example might be a date in  the
       form ddmmmyy, defined by this pattern:

         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$

       If the application sees the  user's  keystrokes  one  by
       one,  and  can  check that what has been typed so far is
       potentially valid, it is able to raise an error as  soon
       as  a mistake is made, possibly beeping and not reflect-
       ing the character that has been  typed.  This  immediate
       feedback  is likely to be a better user interface than a
       check that is delayed until the entire string  has  been
       entered.

       PCRE  supports  the concept of partial matching by means
       of the PCRE_PARTIAL option, which can be set when  call-
       ing  pcre_exec()  or  pcre_dfa_exec(). When this flag is
       set for pcre_exec(), the return code  PCRE_ERROR_NOMATCH
       is converted into PCRE_ERROR_PARTIAL if at any time dur-
       ing the matching process the last part  of  the  subject
       string  matched  part of the pattern. Unfortunately, for
       non-anchored matching, it is not possible to obtain  the
       position  of the start of the partial match. No captured
       data is set when PCRE_ERROR_PARTIAL is returned.

       When PCRE_PARTIAL is set for pcre_dfa_exec(), the return
       code     PCRE_ERROR_NOMATCH     is     converted    into
       PCRE_ERROR_PARTIAL if the end of the subject is reached,
       there  have been no complete matches, but there is still
       at least one matching possibility. The  portion  of  the
       string  that  provided  the  partial match is set as the
       first matching string.

       Using PCRE_PARTIAL disables one of PCRE's optimizations.
       PCRE  remembers  the last literal byte in a pattern, and
       abandons matching immediately if  such  a  byte  is  not
       present  in the subject string. This optimization cannot
       be used for a subject string that might match only  par-
       tially.

RESTRICTED PATTERNS FOR PCRE_PARTIAL

       Because  of  the  way certain internal optimizations are
       implemented in the pcre_exec() function,  the  PCRE_PAR-
       TIAL  option  cannot  be  used  with all patterns. These
       restrictions do not apply when pcre_dfa_exec() is  used.
       For pcre_exec(), repeated single characters such as

         a{2,4}

       and repeated single metasequences such as

         \d+

       are  not  permitted if the maximum number of occurrences
       is greater than one.  Optional items such as \d?  (where
       the maximum is one) are permitted.  Quantifiers with any
       values are permitted after parentheses, so  the  invalid
       examples above can be coded thus:

         (a){2,4}
         (\d)+

       These  constructions  run more slowly, but for the kinds
       of application that are  envisaged  for  this  facility,
       this is not felt to be a major restriction.

       If  PCRE_PARTIAL is set for a pattern that does not con-
       form to the restrictions, pcre_exec() returns the  error
       code PCRE_ERROR_BADPARTIAL (-13).

EXAMPLE OF PARTIAL MATCHING USING PCRETEST

       If  the escape sequence \P is present in a pcretest data
       line, the PCRE_PARTIAL flag is used for the match.  Here
       is  a  run of pcretest that uses the date example quoted
       above:

           re>
       /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
         data> 25jun04\P
          0: 25jun04
          1: jun
         data> 25dec3\P
         Partial match
         data> 3ju\P
         Partial match
         data> 3juj\P
         No match
         data> j\P
         No match

       The first data string is matched completely, so pcretest
       shows the matched substrings. The remaining four strings
       do not match the complete pattern, but the first two are
       partial  matches.  The  same test, using pcre_dfa_exec()
       matching (by means of the \D escape sequence),  produces
       the following output:

           re>
       /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
         data> 25jun04\P\D
          0: 25jun04
         data> 23dec3\P\D
         Partial match: 23dec3
         data> 3ju\P\D
         Partial match: 3ju
         data> 3juj\P\D
         No match
         data> j\P\D
         No match

       Notice that in this case the portion of the string  that
       was matched is made available.

MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()

       When    a   partial   match   has   been   found   using
       pcre_dfa_exec(), it is possible to continue the match by
       providing    additional   subject   data   and   calling
       pcre_dfa_exec() again with  the  same  compiled  regular
       expression,   this  time  setting  the  PCRE_DFA_RESTART
       option. You must also pass the  same  working  space  as
       before,  because  this  is where details of the previous
       partial match are  stored.  Here  is  an  example  using
       pcretest,  using  the  \R  escape  sequence  to  set the
       PCRE_DFA_RESTART option (\P and \D are as above):

           re>
       /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
         data> 23ja\P\D
         Partial match: 23ja
         data> n05\R\D
          0: n05

       The  first  call has "23ja" as the subject, and requests
       partial matching; the second call has "n05" as the  sub-
       ject  for  the continued (restarted) match.  Notice that
       when the match is complete, only the last part is shown;
       PCRE  does  not  retain the previously partially-matched
       string. It is up to the calling program to do that if it
       needs to.

       You  can  set PCRE_PARTIAL with PCRE_DFA_RESTART to con-
       tinue partial  matching  over  multiple  segments.  This
       facility  can  be used to pass very long subject strings
       to pcre_dfa_exec(). However, some  care  is  needed  for
       certain types of pattern.

       1.  If  the  pattern contains tests for the beginning or
       end of a line, you  need  to  pass  the  PCRE_NOTBOL  or
       PCRE_NOTEOL  options,  as  appropriate, when the subject
       string for any call does not contain  the  beginning  or
       end of a line.

       2.  If the pattern contains backward assertions (includ-
       ing \b or \B), you need to arrange for some  overlap  in
       the  subject strings to allow for this. For example, you
       could pass the subject in  chunks  that  are  500  bytes
       long,  but  in  a buffer of 700 bytes, with the starting
       offset set to 200 and the  previous  200  bytes  at  the
       start of the buffer.

       3. Matching a subject string that is split into multiple
       segments does not always produce exactly the same result
       as matching over one single long string.  The difference
       arises when there are multiple  matching  possibilities,
       because  a partial match result is given only when there
       are no completed matches in a call to fBpcre_dfa_exec().
       This  means  that as soon as the shortest match has been
       found, continuation to  a  new  subject  segment  is  no
       longer possible.  Consider this pcretest example:

           re> /dog(sbody)?/
         data> do\P\D
         Partial match: do
         data> gsb\R\P\D
          0: g
         data> dogsbody\D
          0: dogsbody
          1: dog

       The  pattern matches the words "dog" or "dogsbody". When
       the subject is presented  in  several  parts  ("do"  and
       "gsb"  being  the  first two) the match stops when "dog"
       has been found, and it is not possible to  continue.  On
       the  other  hand, if "dogsbody" is presented as a single
       string, both matches are found.

       Because of this phenomenon, it  does  not  usually  make
       sense  to  end  a pattern that is going to be matched in
       this way with a variable repeat.

       4. Patterns that contain alternatives at the  top  level
       which  do  not  all start with the same pattern item may
       not work as expected. For example,  consider  this  pat-
       tern:

         1234|3789

       If  the first part of the subject is "ABC123", a partial
       match of the first alternative is  found  at  offset  3.
       There  is  no  partial match for the second alternative,
       because such a match does not start at the same point in
       the  subject  string.  Attempting  to  continue with the
       string "789" does not yield a match because  only  those
       alternatives  that match at one point in the subject are
       remembered. The problem arises because the start of  the
       second alternative matches within the first alternative.
       There is no problem with anchored patterns  or  patterns
       such as:

         1234|ABCD

       where no string can be a partial match for both alterna-
       tives.

Last updated: 30 November 2006
Copyright (c) 1997-2006 University of Cambridge.



                                                                PCREPARTIAL(3)
