Class Tokenizer


  • public final class Tokenizer
    extends java.lang.Object
    Tokenizer for expressions and inputs.

    This code was originally derived from James Clark's xt, though it has been greatly modified since. See copyright notice at end of file.

    • Field Summary

      Fields 
      Modifier and Type Field Description
      boolean allowSaxonExtensions
      Flag to allow Saxon extensions
      static int BARE_NAME_STATE
      State in which a name is NOT to be merged with what comes next, for example "("
      int currentToken
      The number identifying the most recently read token
      int currentTokenStartOffset
      The position in the input expression where the current token starts
      java.lang.String currentTokenValue
      The string value of the most recently read token
      static int DEFAULT_STATE
      Initial default state of the Tokenizer
      boolean disallowUnionKeyword
      Flag to disallow "union" as a synonym for "|" when parsing XSLT 2.0 patterns
      static char FULL_WIDTH_GT  
      static char FULL_WIDTH_LT  
      java.lang.String input
      The string being parsed
      int inputOffset
      The current position within the input string
      boolean isXQuery
      Flag to indicate that this is XQuery as distinct from XPath
      int languageLevel
      XPath language level: e.g.
      static char NUL  
      static int OPERATOR_STATE
      State in which the next thing to be read is an operator
      static int SEQUENCE_TYPE_STATE
      State in which the next thing to be read is a SequenceType
    • Constructor Summary

      Constructors 
      Constructor Description
      Tokenizer()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void copyTo​(Tokenizer u)
      Checkpoint the state of this tokenizer so that unbounded lookahead is possible (or, restore the state of the tokenizer from a checkpoint)
      int getColumnNumber()
      Get the column number of the current token
      int getColumnNumber​(int offset)
      Return the column number corresponding to a given offset in the expression
      int getLineNumber()
      Get the line number of the current token
      int getLineNumber​(int offset)
      Return the line number corresponding to a given offset in the expression
      int getState()
      Get the current tokenizer state
      void incrementLineNumber​(int offset)
      Increment the line number, making a record of where in the input string the newline character occurred.
      void lookAhead()
      Look ahead by one token.
      void next()
      Get the next token from the input expression.
      char nextChar()
      Read next character directly.
      char peekChar()
      Look ahead to see what the next character will be, without changing the current state
      void setState​(int state)
      Set the tokenizer into a special state
      boolean thereMightBeAnArrowAhead()
      Return true if there is a thin arrow ("->") somewhere beyond the current position.
      void tokenize​(java.lang.String input, int start, int end)
      Prepare a string for tokenization.
      void treatCurrentAsOperator()
      Force the current token to be treated as an operator if possible
      void unreadChar()
      Step back one character.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • DEFAULT_STATE

        public static final int DEFAULT_STATE
        Initial default state of the Tokenizer
        See Also:
        Constant Field Values
      • BARE_NAME_STATE

        public static final int BARE_NAME_STATE
        State in which a name is NOT to be merged with what comes next, for example "("
        See Also:
        Constant Field Values
      • SEQUENCE_TYPE_STATE

        public static final int SEQUENCE_TYPE_STATE
        State in which the next thing to be read is a SequenceType
        See Also:
        Constant Field Values
      • OPERATOR_STATE

        public static final int OPERATOR_STATE
        State in which the next thing to be read is an operator
        See Also:
        Constant Field Values
      • currentToken

        public int currentToken
        The number identifying the most recently read token
      • currentTokenValue

        public java.lang.String currentTokenValue
        The string value of the most recently read token
      • currentTokenStartOffset

        public int currentTokenStartOffset
        The position in the input expression where the current token starts
      • input

        public java.lang.String input
        The string being parsed
      • inputOffset

        public int inputOffset
        The current position within the input string
      • disallowUnionKeyword

        public boolean disallowUnionKeyword
        Flag to disallow "union" as a synonym for "|" when parsing XSLT 2.0 patterns
      • isXQuery

        public boolean isXQuery
        Flag to indicate that this is XQuery as distinct from XPath
      • languageLevel

        public int languageLevel
        XPath language level: e.g. 2.0, 3.0, or 3.1
      • allowSaxonExtensions

        public boolean allowSaxonExtensions
        Flag to allow Saxon extensions
    • Constructor Detail

      • Tokenizer

        public Tokenizer()
    • Method Detail

      • getState

        public int getState()
        Get the current tokenizer state
        Returns:
        the current state
      • setState

        public void setState​(int state)
        Set the tokenizer into a special state
        Parameters:
        state - the new state
      • tokenize

        public void tokenize​(java.lang.String input,
                             int start,
                             int end)
                      throws XPathException
        Prepare a string for tokenization. The actual tokens are obtained by calls on next()
        Parameters:
        input - the string to be tokenized
        start - start point within the string
        end - end point within the string (last character not read): -1 means end of string
        Throws:
        XPathException - if a lexical error occurs, e.g. unmatched string quotes
      • next

        public void next()
                  throws XPathException
        Get the next token from the input expression. The type of token is returned in the currentToken variable, the string value of the token in currentTokenValue.
        Throws:
        XPathException - if a lexical error is detected
      • thereMightBeAnArrowAhead

        public boolean thereMightBeAnArrowAhead()
        Return true if there is a thin arrow ("->") somewhere beyond the current position. This can be used to eliminate unnecessary lookahead
        Returns:
        true if a thin arrow is present. Of course, this might be a false positive.
      • treatCurrentAsOperator

        public void treatCurrentAsOperator()
        Force the current token to be treated as an operator if possible
      • lookAhead

        public void lookAhead()
                       throws XPathException
        Look ahead by one token. This method does the real tokenization work. The method is normally called internally, but the XQuery parser also calls it to resume normal tokenization after dealing with pseudo-XML syntax.
        Throws:
        XPathException - if a lexical error occurs
      • nextChar

        public char nextChar()
        Read next character directly. Used by the XQuery parser when parsing pseudo-XML syntax
        Returns:
        the next character from the input, or NUL at the end of the input
      • peekChar

        public char peekChar()
        Look ahead to see what the next character will be, without changing the current state
        Returns:
        the next character, or NUL at the end of the input.
      • incrementLineNumber

        public void incrementLineNumber​(int offset)
        Increment the line number, making a record of where in the input string the newline character occurred.
        Parameters:
        offset - the place in the input string where the newline occurred
      • unreadChar

        public void unreadChar()
        Step back one character. If this steps back to a previous line, adjust the line number. If we have already read off the end of the input, do nothing.
      • copyTo

        public void copyTo​(Tokenizer u)
        Checkpoint the state of this tokenizer so that unbounded lookahead is possible (or, restore the state of the tokenizer from a checkpoint)
        Parameters:
        u - When checkpointing, a Tokenizer used simply to hold the state so that it can be restored later. This tokenizer is not capable of active tokenizing because many of its variables are uninitialised. When restoring from a checkpoint, the original tokenizer whose state is to be restored.
      • getLineNumber

        public int getLineNumber()
        Get the line number of the current token
        Returns:
        the line number. Line numbers reported by the tokenizer start at zero.
      • getColumnNumber

        public int getColumnNumber()
        Get the column number of the current token
        Returns:
        the column number. Column numbers reported by the tokenizer start at zero.
      • getLineNumber

        public int getLineNumber​(int offset)
        Return the line number corresponding to a given offset in the expression
        Parameters:
        offset - the byte offset in the expression
        Returns:
        the line number. Line and column numbers reported by the tokenizer start at zero.
      • getColumnNumber

        public int getColumnNumber​(int offset)
        Return the column number corresponding to a given offset in the expression
        Parameters:
        offset - the byte offset in the expression
        Returns:
        the column number. Line and column numbers reported by the tokenizer start at zero.