net.sf.saxon.regex.REMatcher

public class REMatcher extends Object

RE is an efficient, lightweight regular expression evaluator/matcher class. Regular expressions are pattern descriptions which enable sophisticated matching of strings. In addition to being able to match a string against a pattern, you can also extract parts of the match. This is especially useful in text parsing! Details on the syntax of regular expression patterns are given below.

To compile a regular expression (RE), you can simply construct an RE matcher object from the string specification of the pattern, like this:

  RE r = new RE("a*b");

Once you have done this, you can call either of the RE.match methods to perform matching on a String. For example:

  boolean matched = r.match("aaaab");

will cause the boolean matched to be set to true because the pattern "a*b" matches the string "aaaab".

If you were interested in the number of a's which matched the first part of our example expression, you could change the expression to "(a*)b". Then when you compiled the expression and matched it against something like "xaaaab", you would get results like this:

  RE r = new RE("(a*)b");                  // Compile expression
  boolean matched = r.match("xaaaab");     // Match against "xaaaab"

  String wholeExpr = r.getParen(0);        // wholeExpr will be 'aaaab'
  String insideParens = r.getParen(1);     // insideParens will be 'aaaa'

  int startWholeExpr = r.getParenStart(0); // startWholeExpr will be index 1
  int endWholeExpr = r.getParenEnd(0);     // endWholeExpr will be index 6
  int lenWholeExpr = r.getParenLength(0);  // lenWholeExpr will be 5

  int startInside = r.getParenStart(1);    // startInside will be index 1
  int endInside = r.getParenEnd(1);        // endInside will be index 5
  int lenInside = r.getParenLength(1);     // lenInside will be 4

You can also refer to the contents of a parenthesized expression within a regular expression itself. This is called a 'backreference'. The first backreference in a regular expression is denoted by \1, the second by \2 and so on. So the expression:

  ([0-9]+)=\1

will match any string of the form n=n (like 0=0 or 2=2).

The full regular expression syntax accepted by RE is as defined in the XSD 1.1 specification, modified by the XPath 2.0 or 3.0 specifications.

Line terminators

A line terminator is a one- or two-character sequence that marks the end of a line of the input character sequence. The following are recognized as line terminators:

A newline (line feed) character ('\n'),
A carriage-return character followed immediately by a newline character ("\r\n"),
A standalone carriage-return character ('\r'),
A next-line character (''),
A line-separator character (' '), or
A paragraph-separator character (' ).

RE runs programs compiled by the RECompiler class. But the RE matcher class does not include the actual regular expression compiler for reasons of efficiency. In fact, if you want to pre-compile one or more regular expressions, the 'recompile' class can be invoked from the command line to produce compiled output like this:

    // Pre-compiled regular expression "a*b"
    char[] re1Instructions =
    {
        0x007c, 0x0000, 0x001a, 0x007c, 0x0000, 0x000d, 0x0041,
        0x0001, 0x0004, 0x0061, 0x007c, 0x0000, 0x0003, 0x0047,
        0x0000, 0xfff6, 0x007c, 0x0000, 0x0003, 0x004e, 0x0000,
        0x0003, 0x0041, 0x0001, 0x0004, 0x0062, 0x0045, 0x0000,
        0x0000,
    };


    REProgram re1 = new REProgram(re1Instructions);

You can then construct a regular expression matcher (RE) object from the pre-compiled expression re1 and thus avoid the overhead of compiling the expression at runtime. If you require more dynamic regular expressions, you can construct a single RECompiler object and re-use it to compile each expression. Similarly, you can change the program run by a given matcher object at any time. However, RE and RECompiler are not threadsafe (for efficiency reasons, and because requiring thread safety in this class is deemed to be a rare requirement), so you will need to construct a separate compiler or matcher object for each thread (unless you do thread synchronization yourself). Once expression compiled into the REProgram object, REProgram can be safely shared across multiple threads and RE objects.

ISSUES:

Not *all* possibilities are considered for greediness when backreferences are involved (as POSIX suggests should be the case). The POSIX RE "(ac*)c*d[ac]*\1", when matched against "acdacaa" should yield a match of acdacaa where \1 is "a". This is not the case in this RE package, and actually Perl doesn't go to this extent either! Until someone actually complains about this, I'm not sure it's worth "fixing". If it ever is fixed, test #137 in RETest.txt should be updated.

This library is based on the Apache Jakarta regex library as downloaded on 3 January 2012. Changes have been made to make the grammar and semantics conform to XSD and XPath rules; these changes are listed in source code comments in the RECompiler source code module.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

REMatcher.State
Constructor Summary

Constructors

Constructor

Description

REMatcher(REProgram program)

Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
Method Summary

Modifier and Type

Method

Description

REMatcher.State

captureState()

protected void

clearCapturedGroupsBeyond(int pos)

Clear any captured groups whose start position is at or beyond some specified position

UnicodeString

getParen(int which)

Gets the contents of a parenthesized subexpression after a successful match.

int

getParenCount()

Returns the number of parenthesized subexpressions available after a successful match.

final int

getParenEnd(int which)

Returns the end index of a given paren level.

final int

getParenStart(int which)

Returns the start index of a given paren level.

REProgram

getProgram()

Returns the current regular expression program in use by this matcher object.

boolean

isAnchoredMatch(UnicodeString search)

Tests whether the regex matches a string in its entirety, anchored at both ends

boolean

match(String search)

Matches the current regular expression program against a String.

boolean

match(UnicodeString search, int i)

Matches the current regular expression program against a character array, starting at a given index.

protected boolean

matchAt(int i, boolean anchored)

Match the current regular expression program against the current input string, starting at index i of the input string.

UnicodeString

replace(UnicodeString in, UnicodeString replacement)

Substitutes a string for this regular expression in another string.

UnicodeString

replaceWith(UnicodeString in, BiFunction<UnicodeString,UnicodeString[],UnicodeString> replacer)

Substitutes a string for this regular expression in another string.

void

resetState(REMatcher.State state)

protected final void

setParenEnd(int which, int i)

Sets the end of a paren level

protected final void

setParenStart(int which, int i)

Sets the start of a paren level

void

setProgram(REProgram program)

Sets the current regular expression program used by this matcher object.

List<UnicodeString>

split(UnicodeString s)

Splits a string into an array of strings on regular expression boundaries.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- REMatcher
  
  public REMatcher(REProgram program)
  
  Construct a matcher for a pre-compiled regular expression from program (bytecode) data.
  Parameters:
  
  program - Compiled regular expression program
  
  See Also:
  
  RECompiler
Method Details
- setProgram
  
  public void setProgram(REProgram program)
  
  Sets the current regular expression program used by this matcher object.
  Parameters:
  
  program - Regular expression program compiled by RECompiler.
  
  See Also:
  
  RECompiler
  
  REProgram
- getProgram
  
  public REProgram getProgram()
  
  Returns the current regular expression program in use by this matcher object.
  Returns:
  
  Regular expression program
  
  See Also:
  
  setProgram(net.sf.saxon.regex.REProgram)
- getParenCount
  
  public int getParenCount()
  
  Returns the number of parenthesized subexpressions available after a successful match.
  
  Returns:
  
  Number of available parenthesized subexpressions
- getParen
  
  public UnicodeString getParen(int which)
  
  Gets the contents of a parenthesized subexpression after a successful match.
  
  Parameters:
  
  which - Nesting level of subexpression
  
  Returns:
  
  String
- getParenStart
  
  public final int getParenStart(int which)
  
  Returns the start index of a given paren level.
  
  Parameters:
  
  which - Nesting level of subexpression
  
  Returns:
  
  String index
- getParenEnd
  
  public final int getParenEnd(int which)
  
  Returns the end index of a given paren level.
  
  Parameters:
  
  which - Nesting level of subexpression
  
  Returns:
  
  String index
- setParenStart
  
  protected final void setParenStart(int which, int i)
  
  Sets the start of a paren level
  
  Parameters:
  
  which - Which paren level
  
  i - Index in input array
- setParenEnd
  
  protected final void setParenEnd(int which, int i)
  
  Sets the end of a paren level
  
  Parameters:
  
  which - Which paren level
  
  i - Index in input array
- clearCapturedGroupsBeyond
  
  protected void clearCapturedGroupsBeyond(int pos)
  
  Clear any captured groups whose start position is at or beyond some specified position
  
  Parameters:
  
  pos - the specified position
- matchAt
  
  protected boolean matchAt(int i, boolean anchored)
  
  Match the current regular expression program against the current input string, starting at index i of the input string. This method is only meant for internal use.
  
  Parameters:
  
  i - The input string index to start matching at
  
  anchored - true if the regex must match all characters up to the end of the string
  
  Returns:
  
  True if the input matched the expression
- isAnchoredMatch
  
  public boolean isAnchoredMatch(UnicodeString search)
  
  Tests whether the regex matches a string in its entirety, anchored at both ends
  
  Parameters:
  
  search - the string to be matched
  
  Returns:
  
  true if the regex matches the whole string
- match
  
  public boolean match(UnicodeString search, int i)
  
  Matches the current regular expression program against a character array, starting at a given index.
  
  Parameters:
  
  search - String to match against
  
  i - Index to start searching at
  
  Returns:
  
  True if string matched
- match
  
  public boolean match(String search)
  
  Matches the current regular expression program against a String.
  
  Parameters:
  
  search - String to match against
  
  Returns:
  
  True if string matched
- split
  
  public List<UnicodeString> split(UnicodeString s)
  
  Splits a string into an array of strings on regular expression boundaries. This function works the same way as the Perl function of the same name. Given a regular expression of "[ab]+" and a string to split of "xyzzyababbayyzabbbab123", the result would be the array of Strings "[xyzzy, yyz, 123]".
  Please note that the first string in the resulting array may be an empty string. This happens when the very first character of input string is matched by the pattern.
  
  Parameters:
  
  s - String to split on this regular exression
  
  Returns:
  
  Array of strings
- replace
  
  public UnicodeString replace(UnicodeString in, UnicodeString replacement)
  
  Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".
  It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".
  
  Note: $0 represents the whole match.
  
  Parameters:
  
  in - String to substitute within
  
  replacement - String to substitute for matches of this regular expression
  
  Returns:
  
  The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).
- replaceWith
  
  public UnicodeString replaceWith(UnicodeString in, BiFunction<UnicodeString,UnicodeString[],UnicodeString> replacer)
  
  Substitutes a string for this regular expression in another string. This method works like the Perl function of the same name. Given a regular expression of "a*b", a String to substituteIn of "aaaabfooaaabgarplyaaabwackyb" and the substitution String "-", the resulting String returned by subst would be "-foo-garply-wacky-".
  It is also possible to reference the contents of a parenthesized expression with $0, $1, ... $9. A regular expression of "http://[\\.\\w\\-\\?/~_@&=%]+", a String to substituteIn of "visit us: http://www.apache.org!" and the substitution String "<a href=\"$0\">$0</a>", the resulting String returned by subst would be "visit us: <a href=\"http://www.apache.org\">http://www.apache.org</a>!".
  
  Note: $0 represents the whole match.
  
  Parameters:
  
  in - String to substitute within
  
  replacer - Function to process each matching substring and return a replacement
  
  Returns:
  
  The string substituteIn with zero or more occurrences of the current regular expression replaced with the substitution String (if this regular expression object doesn't match at any position, the original String is returned unchanged).
- captureState
  
  public REMatcher.State captureState()
- resetState
  
  public void resetState(REMatcher.State state)

Class REMatcher

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

REMatcher

Method Details

setProgram

getProgram

getParenCount

getParen

getParenStart

getParenEnd

setParenStart

setParenEnd

clearCapturedGroupsBeyond

matchAt

isAnchoredMatch

match

match

split

replace

replaceWith

captureState

resetState