PCRETEST(1)                                                        PCRETEST(1)



NAME
       pcretest - a program for testing Perl-compatible regular
       expressions.

SYNOPSIS

       pcretest [options] [source] [destination]

       pcretest was written as a test program for the PCRE reg-
       ular  expression library itself, but it can also be used
       for experimenting with regular expressions.  This  docu-
       ment  describes  the  features  of the test program; for
       details of the regular expressions themselves,  see  the
       pcrepattern  documentation.  For  details  of  the  PCRE
       library  function  calls  and  their  options,  see  the
       pcreapi documentation.

OPTIONS

       -b        Behave as if each regex has the /B (show byte-
                 code) modifier; the internal  form  is  output
                 after compilation.

       -C        Output the version number of the PCRE library,
                 and  all  available  information   about   the
                 optional  features that are included, and then
                 exit.

       -d        Behave as if each regex  has  the  /D  (debug)
                 modifier;  the  internal  form and information
                 about the compiled  pattern  is  output  after
                 compilation; -d is equivalent to -b -i.

       -dfa      Behave  as  if  each data line contains the \D
                 escape sequence; this causes  the  alternative
                 matching function, pcre_dfa_exec(), to be used
                 instead of the standard  pcre_exec()  function
                 (more detail is given below).

       -help     Output  a brief summary these options and then
                 exit.

       -i        Behave as if each regex has the  /I  modifier;
                 information  about  the  compiled  pattern  is
                 given after compilation.

       -m        Output the size of each compiled pattern after
                 it  has  been  compiled. This is equivalent to
                 adding /M to each regular expression. For com-
                 patibility  with earlier versions of pcretest,
                 -s is a synonym for -m.

       -o osize  Set the number of elements in the output  vec-
                 tor  that  is used when calling pcre_exec() or
                 pcre_dfa_exec() to be osize. The default value
                 is 45, which is enough for 14 capturing subex-
                 pressions  for  pcre_exec()  or  22  different
                 matches  for  pcre_dfa_exec(). The vector size
                 can be changed for individual  matching  calls
                 by  including \O in the data line (see below).

       -p        Behave as if each regex has the  /P  modifier;
                 the  POSIX  wrapper  API is used to call PCRE.
                 None of the other options has any effect  when
                 -p is set.

       -q        Do  not  output the version number of pcretest
                 at the start of execution.

       -S size   On Unix-like systems, set the size of the run-
                 time stack to size megabytes.

       -t        Run  each compile, study, and match many times
                 with a timer, and output  resulting  time  per
                 compile or match (in milliseconds). Do not set
                 -m with -t, because you will then get the size
                 output a zillion times, and the timing will be
                 distorted. You can control the number of iter-
                 ations  that  are used for timing by following
                 -t with a number (as a separate  item  on  the
                 command  line).  For  example, "-t 1000" would
                 iterate 1000 times. The default is to  iterate
                 500000 times.

       -tm       This  is like -t except that it times only the
                 matching  phase,  not  the  compile  or  study
                 phases.

DESCRIPTION

       If  pcretest  is  given two filename arguments, it reads
       from the first and writes to the second. If it is  given
       only  one filename argument, it reads from that file and
       writes to stdout. Otherwise, it  reads  from  stdin  and
       writes  to  stdout,  and prompts for each line of input,
       using "re>"  to  prompt  for  regular  expressions,  and
       "data>" to prompt for data lines.

       The  program  handles  any  number of sets of input on a
       single input  file.  Each  set  starts  with  a  regular
       expression,  and continues with any number of data lines
       to be matched against the pattern.

       Each data line is matched separately and  independently.
       If  you  want  to do multi-line matches, you have to use
       the \n escape sequence (or \r or \r\n,  etc.,  depending
       on  the  newline  setting)  in a single line of input to
       encode the newline sequences. There is no limit  on  the
       length  of data lines; the input buffer is automatically
       extended if it is too small.

       An empty line signals the end  of  the  data  lines,  at
       which  point a new regular expression is read. The regu-
       lar expressions are given enclosed in  any  non-alphanu-
       meric delimiters other than backslash, for example:

         /(a|bc)x+yz/

       White  space  before the initial delimiter is ignored. A
       regular expression may be continued over  several  input
       lines, in which case the newline characters are included
       within it. It  is  possible  to  include  the  delimiter
       within the pattern by escaping it, for example

         /abc\/def/

       If  you do so, the escape and the delimiter form part of
       the  pattern,  but  since  delimiters  are  always  non-
       alphanumeric,  this  does not affect its interpretation.
       If the terminating delimiter is immediately followed  by
       a backslash, for example,

         /abc/\

       then  a  backslash  is  added to the end of the pattern.
       This is done to provide  a  way  of  testing  the  error
       condition that arises if a pattern finishes with a back-
       slash, because

         /abc\/

       is interpreted as the  first  line  of  a  pattern  that
       starts  with  "abc/",  causing pcretest to read the next
       line as a continuation of the regular expression.

PATTERN MODIFIERS

       A pattern may be followed by any  number  of  modifiers,
       which  are  mostly  single  characters.  Following  Perl
       usage, these are referred to below as, for example, "the
       /i  modifier",  even though the delimiter of the pattern
       need not always be a slash, and no slash  is  used  when
       writing  modifiers.  Whitespace  may  appear between the
       final pattern delimiter  and  the  first  modifier,  and
       between the modifiers themselves.

       The  /i, /m, /s, and /x modifiers set the PCRE_CASELESS,
       PCRE_MULTILINE, PCRE_DOTALL, or  PCRE_EXTENDED  options,
       respectively,  when pcre_compile() is called. These four
       modifier letters have the same  effect  as  they  do  in
       Perl. For example:

         /caseless/i

       The  following table shows additional modifiers for set-
       ting PCRE options that do not correspond to anything  in
       Perl:

         /A       PCRE_ANCHORED
         /C       PCRE_AUTO_CALLOUT
         /E       PCRE_DOLLAR_ENDONLY
         /f       PCRE_FIRSTLINE
         /J       PCRE_DUPNAMES
         /N       PCRE_NO_AUTO_CAPTURE
         /U       PCRE_UNGREEDY
         /X       PCRE_EXTRA
         /<cr>    PCRE_NEWLINE_CR
         /<lf>    PCRE_NEWLINE_LF
         /<crlf>  PCRE_NEWLINE_CRLF
         /<any>   PCRE_NEWLINE_ANY

       Those  specifying  line  ending  sequencess  are literal
       strings as shown. This example sets  multiline  matching
       with CRLF as the line ending sequence:

         /^abc/m<crlf>

       Details  of the meanings of these PCRE options are given
       in the pcreapi documentation.

   Finding all matches in a string

       Searching for all possible matches within  each  subject
       string  can be requested by the /g or /G modifier. After
       finding a match, PCRE is  called  again  to  search  the
       remainder  of the subject string. The difference between
       /g and /G is that the former uses the startoffset  argu-
       ment  to  pcre_exec()  to start searching at a new point
       within the entire string (which is in effect  what  Perl
       does),  whereas  the latter passes over a shortened sub-
       string. This makes a difference to the matching  process
       if  the  pattern  begins  with  a  lookbehind  assertion
       (including \b or \B).

       If any call to  pcre_exec()  in  a  /g  or  /G  sequence
       matches  an empty string, the next call is done with the
       PCRE_NOTEMPTY and PCRE_ANCHORED flags set  in  order  to
       search  for another, non-empty, match at the same point.
       If this second match fails, the start offset is advanced
       by  one,  and the normal match is retried. This imitates
       the way Perl handles such cases when using the /g  modi-
       fier or the split() function.

   Other modifiers

       There  are  yet  more  modifiers for controlling the way
       pcretest operates.

       The /+ modifier requests that as well as outputting  the
       substring  that  matched  the  entire  pattern, pcretest
       should in addition output the remainder of  the  subject
       string.  This is useful for tests where the subject con-
       tains multiple copies of the same substring.

       The /B modifier is a debugging feature. It requests that
       pcretest  output  a  representation of the compiled byte
       code after compilation.

       The /L modifier must be followed directly by the name of
       a locale, for example,

         /pattern/Lfr_FR

       For this reason, it must be the last modifier. The given
       locale is set, pcre_maketables() is called  to  build  a
       set of character tables for the locale, and this is then
       passed to  pcre_compile()  when  compiling  the  regular
       expression.  Without  an  /L modifier, NULL is passed as
       the tables pointer; that is,  /L  applies  only  to  the
       expression on which it appears.

       The  /I  modifier requests that pcretest output informa-
       tion about the compiled pattern (whether it is anchored,
       has a fixed first character, and so on). It does this by
       calling pcre_fullinfo() after compiling  a  pattern.  If
       the  pattern  is  studied,  the results of that are also
       output.

       The /D modifier is a  PCRE  debugging  feature,  and  is
       equivalent to /BI, that is, both the /B and the /I modi-
       fiers.

       The /F modifier causes pcretest to flip the  byte  order
       of  the  fields  in  the  compiled  pattern that contain
       2-byte and 4-byte numbers. This facility is for  testing
       the  feature  in PCRE that allows it to execute patterns
       that were compiled on a host with  a  different  endian-
       ness.  This  feature  is  not  available  when the POSIX
       interface to PCRE is being used, that is,  when  the  /P
       pattern  modifier  is  specified.  See  also the section
       about saving and reloading compiled patterns below.

       The /S modifier causes pcre_study() to be  called  after
       the  expression  has been compiled, and the results used
       when the expression is matched.

       The /M modifier causes the size of memory block used  to
       hold the compiled pattern to be output.

       The  /P  modifier  causes  pcretest to call PCRE via the
       POSIX wrapper API rather than its native API. When  this
       is  done,  all other modifiers except /i, /m, and /+ are
       ignored. REG_ICASE is set if /i is present, and REG_NEW-
       LINE  is  set  if  /m  is present. The wrapper functions
       force PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
       REG_NEWLINE is set.

       The  /8  modifier  causes pcretest to call PCRE with the
       PCRE_UTF8 option set. This turns on  support  for  UTF-8
       character  handling  in  PCRE, provided that it was com-
       piled with this  support  enabled.  This  modifier  also
       causes  any non-printing characters in output strings to
       be printed using the  \x{hh...}  notation  if  they  are
       valid UTF-8 sequences.

       If  the  /? modifier is used with /8, it causes pcretest
       to  call  pcre_compile()  with  the   PCRE_NO_UTF8_CHECK
       option, to suppress the checking of the string for UTF-8
       validity.

DATA LINES

       Before each data line is passed to pcre_exec(),  leading
       and  trailing  whitespace  is  removed,  and  it is then
       scanned for \ escapes. Some of these are pretty esoteric
       features,  intended  for  checking  out some of the more
       complicated features of PCRE. If you  are  just  testing
       "ordinary"  regular expressions, you probably don't need
       any of these. The following escapes are recognized:

         \a         alarm (BEL, \x07)
         \b         backspace (\x08)
         \e         escape (\x27)
         \f         formfeed (\x0c)
         \n         newline (\x0a)
         \qdd       set the PCRE_MATCH_LIMIT limit to dd
                      (any number of digits)
         \r         carriage return (\x0d)
         \t         tab (\x09)
         \v         vertical tab (\x0b)
         \nnn       octal character (up to 3 octal digits)
         \xhh       hexadecimal character (up to 2 hex digits)
         \x{hh...}  hexadecimal character, any number of digits
                      in UTF-8 mode
         \A            pass   the   PCRE_ANCHORED   option   to
       pcre_exec()
                      or pcre_dfa_exec()
         \B         pass the PCRE_NOTBOL option to pcre_exec()
                      or pcre_dfa_exec()
         \Cdd       call pcre_copy_substring() for substring dd
                      after  a  successful  match  (number less
       than 32)
         \Cname     call pcre_copy_named_substring()  for  sub-
       string
                      "name"  after  a  successful  match (name
       termin-
                      ated by next non alphanumeric character)
         \C+        show the  current  captured  substrings  at
       callout
                      time
         \C-        do not supply a callout function
         \C!n       return 1 instead of 0 when callout number n
       is
                      reached
         \C!n!m     return 1 instead of 0 when callout number n
       is
                      reached for the nth time
         \C*n        pass  the  number  n  (may be negative) as
       callout
                      data; this is used as the callout  return
       value
         \D         use the pcre_dfa_exec() match function
         \F         only shortest match for pcre_dfa_exec()
         \Gdd       call pcre_get_substring() for substring dd
                      after  a  successful  match  (number less
       than 32)
         \Gname     call  pcre_get_named_substring()  for  sub-
       string
                      "name"  after  a  successful  match (name
       termin-
                      ated by next non-alphanumeric character)
         \L         call pcre_get_substringlist() after a
                      successful match
         \M         discover the minimum MATCH_LIMIT and
                      MATCH_LIMIT_RECURSION settings
         \N           pass   the   PCRE_NOTEMPTY   option    to
       pcre_exec()
                      or pcre_dfa_exec()
         \Odd       set the size of the output vector passed to
                      pcre_exec() to dd (any number of digits)
         \P         pass the PCRE_PARTIAL option to pcre_exec()
                      or pcre_dfa_exec()
         \Qdd       set the PCRE_MATCH_LIMIT_RECURSION limit to
       dd
                      (any number of digits)
         \R          pass  the   PCRE_DFA_RESTART   option   to
       pcre_dfa_exec()
         \S          output  details  of  memory get/free calls
       during matching
         \Z         pass the PCRE_NOTEOL option to pcre_exec()
                      or pcre_dfa_exec()
         \?         pass the PCRE_NO_UTF8_CHECK option to
                      pcre_exec() or pcre_dfa_exec()
         \>dd       start the match at offset dd (any number of
       digits);
                      this  sets  the  startoffset argument for
       pcre_exec()
                      or pcre_dfa_exec()
         \<cr>       pass   the   PCRE_NEWLINE_CR   option   to
       pcre_exec()
                      or pcre_dfa_exec()
         \<lf>        pass   the   PCRE_NEWLINE_LF   option  to
       pcre_exec()
                      or pcre_dfa_exec()
         \<crlf>     pass  the  PCRE_NEWLINE_CRLF   option   to
       pcre_exec()
                      or pcre_dfa_exec()
         \<any>       pass   the   PCRE_NEWLINE_ANY  option  to
       pcre_exec()
                      or pcre_dfa_exec()

       The escapes that specify line ending sequences are  lit-
       eral strings, exactly as shown. No more than one newline
       setting should be present in any data line.

       A backslash followed by anything else just  escapes  the
       anything  else.  If  the  very last character is a back-
       slash, it is ignored. This gives a  way  of  passing  an
       empty  line  as data, since a real empty line terminates
       the data input.

       If \M is present,  pcretest  calls  pcre_exec()  several
       times,  with  different  values  in  the match_limit and
       match_limit_recursion  fields  of  the  pcre_extra  data
       structure,  until  it finds the minimum numbers for each
       parameter  that  allow  pcre_exec()  to  complete.   The
       match_limit  number  is a measure of the amount of back-
       tracking that takes place, and checking it  out  can  be
       instructive.  For  most  simple  matches,  the number is
       quite small, but for patterns with very large numbers of
       matching possibilities, it can become large very quickly
       with  increasing   length   of   subject   string.   The
       match_limit_recursion  number  is  a measure of how much
       stack (or, if PCRE is compiled with NO_RECURSE, how much
       heap) memory is needed to complete the match attempt.

       When  \O  is  used, the value specified may be higher or
       lower than the size set by the -O  command  line  option
       (or  defaulted  to  45);  \O applies only to the call of
       pcre_exec() for the line in which it appears.

       If the /P modifier was present on the  pattern,  causing
       the  POSIX  wrapper API to be used, the only option-set-
       ting sequences that have any effect are \B and \Z, caus-
       ing  REG_NOTBOL  and  REG_NOTEOL,  respectively,  to  be
       passed to regexec().

       The use of \x{hh...} to represent  UTF-8  characters  is
       not  dependent on the use of the /8 modifier on the pat-
       tern. It is recognized always. There may be  any  number
       of  hexadecimal  digits inside the braces. The result is
       from one to six bytes, encoded according  to  the  UTF-8
       rules.

THE ALTERNATIVE MATCHING FUNCTION

       By  default,  pcretest  uses  the standard PCRE matching
       function, pcre_exec() to  match  each  data  line.  From
       release 6.0, PCRE supports an alternative matching func-
       tion, pcre_dfa_test(), which  operates  in  a  different
       way,  and has some restrictions. The differences between
       the two functions are described in the pcrematching doc-
       umentation.

       If  a  data  line contains the \D escape sequence, or if
       the command line contains the -dfa option, the  alterna-
       tive  matching  function is called.  This function finds
       all possible matches at a given point. If, however,  the
       \F escape sequence is present in the data line, it stops
       after the first match  is  found.  This  is  always  the
       shortest possible match.

DEFAULT OUTPUT FROM PCRETEST

       This section describes the output when the normal match-
       ing function, pcre_exec(), is being used.

       When a match succeeds, pcretest outputs the list of cap-
       tured substrings that pcre_exec() returns, starting with
       number 0 for the string that matched the whole  pattern.
       Otherwise, it outputs "No match" or "Partial match" when
       pcre_exec()      returns      PCRE_ERROR_NOMATCH      or
       PCRE_ERROR_PARTIAL, respectively, and otherwise the PCRE
       negative error number. Here is an example of an interac-
       tive pcretest run.

         $ pcretest
         PCRE version 7.0 30-Nov-2006

           re> /^abc(\d+)/
         data> abc123
          0: abc123
          1: 123
         data> xyz
         No match

       If the strings contain any non-printing characters, they
       are output as \0x escapes, or as \x{...} escapes if  the
       /8  modifier  was  present on the pattern. See below for
       the definition of non-printing characters. If  the  pat-
       tern  has the /+ modifier, the output for substring 0 is
       followed by the the rest of the subject string,  identi-
       fied by "0+" like this:

           re> /cat/+
         data> cataract
          0: cat
          0+ aract

       If the pattern has the /g or /G modifier, the results of
       successive matching attempts  are  output  in  sequence,
       like this:

           re> /\Bi(\w\w)/g
         data> Mississippi
          0: iss
          1: ss
          0: iss
          1: ss
          0: ipp
          1: pp

       "No  match"  is  output  only if the first match attempt
       fails.

       If any of the sequences \C, \G, or \L are present  in  a
       data  line  that is successfully matched, the substrings
       extracted by the convenience functions are  output  with
       C,  G,  or L after the string number instead of a colon.
       This is in addition to the normal full list. The  string
       length  (that  is,  the return from the extraction func-
       tion) is given in parentheses after each string  for  \C
       and \G.

       Note that whereas patterns can be continued over several
       lines (a plain ">" prompt is  used  for  continuations),
       data  lines may not. However newlines can be included in
       data by means of the  \n  escape  (or  \r,  \r\n,  etc.,
       depending on the newline sequence setting).

OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION

       When the alternative matching function, pcre_dfa_exec(),
       is used (by means of the \D escape sequence or the  -dfa
       command  line  option), the output consists of a list of
       all the matches that start at the  first  point  in  the
       subject where there is at least one match. For example:

           re> /(tang|tangerine|tan)/
         data> yellow tangerine\D
          0: tangerine
          1: tang
          2: tan

       (Using  the  normal matching function on this data finds
       only "tang".) The  longest  matching  string  is  always
       given first (and numbered zero).

       If  /g is present on the pattern, the search for further
       matches resumes at the end of  the  longest  match.  For
       example:

           re> /(tang|tangerine|tan)/g
         data> yellow tangerine and tangy sultana\D
          0: tangerine
          1: tang
          2: tan
          0: tang
          1: tan
          0: tan

       Since  the  matching function does not support substring
       capture, the escape sequences that  are  concerned  with
       captured substrings are not relevant.

RESTARTING AFTER A PARTIAL MATCH

       When  the  alternative  matching  function has given the
       PCRE_ERROR_PARTIAL return, indicating that  the  subject
       partially matched the pattern, you can restart the match
       with additional subject data by means of the  \R  escape
       sequence. For example:

           re>
       /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
         data> 23ja\P\D
         Partial match: 23ja
         data> n05\R\D
          0: n05

       For further information about partial matching, see  the
       pcrepartial documentation.

CALLOUTS

       If the pattern contains any callout requests, pcretest's
       callout function is called during matching.  This  works
       with  both  matching  functions.  By default, the called
       function displays the callout number, the start and cur-
       rent  positions in the text at the callout time, and the
       next pattern item to be tested. For example, the output

         --->pqrabcdef
           0    ^  ^     \d

       indicates that callout number 0  occurred  for  a  match
       attempt  starting at the fourth character of the subject
       string, when the pointer was at the seventh character of
       the  data,  and  when the next pattern item was \d. Just
       one circumflex is output if the start and current  posi-
       tions are the same.

       Callouts  numbered 255 are assumed to be automatic call-
       outs, inserted as a result of the /C  pattern  modifier.
       In this case, instead of showing the callout number, the
       offset in the pattern, preceded by a  plus,  is  output.
       For example:

           re> /\d?[A-E]\*/C
         data> E*
         --->E*
          +0 ^      \d?
          +3 ^      [A-E]
          +8 ^^     \*
         +10 ^ ^
          0: E*

       The  callout function in pcretest returns zero (carry on
       matching) by default, but you can use a  \C  item  in  a
       data line (as described above) to change this.

       Inserting callouts can be helpful when using pcretest to
       check  complicated  regular  expressions.  For   further
       information about callouts, see the pcrecallout documen-
       tation.

NON-PRINTING CHARACTERS

       When pcretest is outputting text in the compiled version
       of a pattern, bytes other than 32-126 are always treated
       as non-printing characters are are  therefore  shown  as
       hex escapes.

       When  pcretest is outputting text that is a matched part
       of a subject string, it behaves in the same way,  unless
       a  different  locale has been set for the pattern (using
       the /L modifier). In this case, the  isprint()  function
       to distinguish printing and non-printing characters.

SAVING AND RELOADING COMPILED PATTERNS

       The   facilities  described  in  this  section  are  not
       available when the POSIX inteface to PCRE is being used,
       that is, when the /P pattern modifier is specified.

       When  the  POSIX  interface is not in use, you can cause
       pcretest to write a compiled pattern to a file, by  fol-
       lowing  the modifiers with > and a file name.  For exam-
       ple:

         /pattern/im >/some/file

       See the pcreprecompile documentation  for  a  discussion
       about saving and re-using compiled patterns.

       The  data  that  is  written  is binary. The first eight
       bytes are the length of the compiled pattern  data  fol-
       lowed  by  the  length  of the optional study data, each
       written as four bytes in big-endian order (most signifi-
       cant  byte first). If there is no study data (either the
       pattern was not studied, or studying did not return  any
       data),  the  second length is zero. The lengths are fol-
       lowed by an exact copy of the compiled pattern. If there
       is additional study data, this follows immediately after
       the compiled pattern. After writing the  file,  pcretest
       expects to read a new pattern.

       A saved pattern can be reloaded into pcretest by specif-
       ing < and a file name instead of a pattern. The name  of
       the  file  must  not contain a < character, as otherwise
       pcretest will interpret the line as a pattern  delimited
       by < characters.  For example:

          re> </some/file
         Compiled regex loaded from /some/file
         No study data

       When  the  pattern has been loaded, pcretest proceeds to
       read data lines in the usual way.

       You can copy a file written by pcretest to  a  different
       host and reload it there, even if the new host has oppo-
       site endianness to the one on which the pattern was com-
       piled.  For  example,  you can compile on an i86 machine
       and run on a SPARC machine.

       File names for saving and reloading can be  absolute  or
       relative,  but note that the shell facility of expanding
       a file name that starts with a tilde (~) is  not  avail-
       able.

       The  ability  to  save  and  reload files in pcretest is
       intended for testing  and  experimentation.  It  is  not
       intended  for  production use because only a single pat-
       tern can be written to a file. Furthermore, there is  no
       facility  for  supplying custom character tables for use
       with a reloaded pattern. If  the  original  pattern  was
       compiled  with custom tables, an attempt to match a sub-
       ject string using a reloaded pattern is likely to  cause
       pcretest  to  crash.   Finally, if you attempt to load a
       file that is not in the correct format,  the  result  is
       undefined.

SEE ALSO

       pcre(3),  pcreapi(3),  pcrecallout(3),  pcrematching(3),
       pcrepartial(d), pcrepattern(3), pcreprecompile(3).

AUTHOR

       Philip Hazel
       University Computing Service,
       Cambridge CB2 3QH, England.

Last updated: 30 November 2006
Copyright (c) 1997-2006 University of Cambridge.



                                                                   PCRETEST(1)
