net.sf.saxon.expr.parser.Tokenizer

public final class Tokenizer extends Object

Tokenizer for XPath and XQuery expressions.

Heavily modified in Saxon 13, based on the principles in the XQuery 4.0 specification. Tokenization is independent of the syntactic context: the tokenizer no longer attempts to distinguish whether '*', for example, is a multiplication operator, a wildcard, or an occurrence indicator; it leaves that decision to the parser.

The main complication is "complex tokens": constructs such as string templates, string constructors, and XQuery direct element constructors that contain embedded expressions, always delimited by curly braces. The tokenizer recognises complex tokens by their initial characters, and returns an appropriate Token object to the parser, which then starts reading its content using calls on nextChar() calls. When an open brace identifying an embedded expression is encountered, the parser calls startEmbeddedExpression(), which has the effect that when the matching closing brace is encountered, and "end of expression" is signalled back to the parser

Field Summary

Fields

Modifier and Type

Field

Description

boolean

allowSaxonExtensions

Flag to allow Saxon extensions

static Predicate<Tokenizer>

CLOSING_CURLY

Predicate indicating that tokenization should stop when a "matching" closing curly brace is found

Token

currentToken

Token indicating end of tokenized input (not necessarily the end of the input string)

int

currentTokenStartOffset

The position in the input string where the current token starts

static Predicate<Tokenizer>

END_OF_INPUT

Predicate indicating that tokenization should stop when the end of the input is reached

String

input

The string being parsed

int

inputOffset

The current position within the input string

boolean

isXQuery

Flag to indicate that this is XQuery as distinct from XPath

int

languageLevel

XPath (or XQuery) language level: e.g.

static final char

NUL
Constructor Summary

Constructors

Constructor

Description

Tokenizer()
Method Summary

Modifier and Type

Method

Description

Tokenizer

checkPoint()

Construct a new tokenizer that includes a snapshot of the current state, so it can be restored later.

String

currentName()

Get the string value of the current name token

void

endEmbeddedExpression()

Indicate that we have finished parsing an embedded expression (within curly braces).

static Predicate<Tokenizer>

finishOnKeyword(String keyword)

Get a predicate to indicate that tokenization should stop when a particular keyword is encountered.

int

getColumnNumber()

Get the column number of the current token

int

getColumnNumber(int offset)

Return the column number corresponding to a given offset in the expression

int

getLineNumber()

Get the line number of the current token

int

getLineNumber(int offset)

Return the line number corresponding to a given offset in the expression

Token

getPrecedingToken()

void

incrementLineNumber(int offset)

Increment the line number, making a record of where in the input string the newline character occurred.

void

lookAhead()

Look ahead by one token.

void

next()

Get the next token from the input expression.

char

nextChar()

Read the next character directly.

Token

peekAhead()

Peek ahead at the next token

char

peekChar()

Look ahead to see what the next character will be, without changing the current state

char

peekChar2()

Look ahead to see what the next character but one will be, without changing the current state

boolean

readDirectCommentConstructor()

boolean

readDirectElementConstructor()

boolean

readDirectPIConstructor()

void

reposition(int offset)

Reposition for reading characters.

void

restart()

Restart tokenisation after, for example, a direct element constructor

void

rollbackTo(Tokenizer checkPoint)

Restore the state of this tokenizer from a snapshot

void

setFinishCondition(Predicate<Tokenizer> condition)

Set the condition that is used to decide when tokenization is complete

void

startEmbeddedExpression()

Indicate that we are starting to parse an embedded expression (enclosed in braces) within content that is being read character-by-character.

void

tokenize(String input, int start, int end)

Prepare a string for tokenization.

void

unreadChar()

Step back one character.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- NUL
  
  public static final char NUL
  See Also:
  
  Constant Field Values
- END_OF_INPUT
  
  public static Predicate<Tokenizer> END_OF_INPUT
  
  Predicate indicating that tokenization should stop when the end of the input is reached
- CLOSING_CURLY
  
  public static Predicate<Tokenizer> CLOSING_CURLY
  
  Predicate indicating that tokenization should stop when a "matching" closing curly brace is found
- currentToken
  
  public Token currentToken
  
  Token indicating end of tokenized input (not necessarily the end of the input string)
- currentTokenStartOffset
  
  public int currentTokenStartOffset
  
  The position in the input string where the current token starts
- input
  
  public String input
  
  The string being parsed
- inputOffset
  
  public int inputOffset
  
  The current position within the input string
- isXQuery
  
  public boolean isXQuery
  
  Flag to indicate that this is XQuery as distinct from XPath
- languageLevel
  
  public int languageLevel
  
  XPath (or XQuery) language level: e.g. 2.0, 3.0, 3.1, 4.0 (times ten, as an integer)
- allowSaxonExtensions
  
  public boolean allowSaxonExtensions
  
  Flag to allow Saxon extensions
Constructor Details
- Tokenizer
  
  public Tokenizer()
Method Details
- finishOnKeyword
  
  public static Predicate<Tokenizer> finishOnKeyword(String keyword)
  
  Get a predicate to indicate that tokenization should stop when a particular keyword is encountered. Used specifically in Gizmo.
  
  Parameters:
  
  keyword - the keyword that signals the end of an expression
  
  Returns:
  
  a suitable predicate
- setFinishCondition
  
  public void setFinishCondition(Predicate<Tokenizer> condition)
  
  Set the condition that is used to decide when tokenization is complete
  
  Parameters:
  
  condition - the completion condition. This condition is tested during lookAhead() processing. The call on lookAhead() first reads past all whitespace and comments, then tests the finish condition, and if the finish condition is satisfied at that point, it sets the next (pending) token to Token.EOF.
- tokenize
  
  public void tokenize(String input, int start, int end) throws XPathException
  
  Prepare a string for tokenization. The actual tokens are obtained by calls on next()
  
  Parameters:
  
  input - the string to be tokenized
  
  start - start point within the string
  
  end - end point within the string (last character not read): -1 means end of string
  
  Throws:
  
  XPathException - if a lexical error occurs, e.g. unmatched string quotes
- restart
  
  public void restart() throws XPathException
  
  Restart tokenisation after, for example, a direct element constructor
  
  Throws:
  
  XPathException
- next
  
  public void next() throws XPathException
  
  Get the next token from the input expression. The type of token is returned in the currentToken variable, the string value of the token in currentTokenValue.
  
  Throws:
  
  XPathException - if a lexical error is detected
- peekAhead
  
  public Token peekAhead()
  
  Peek ahead at the next token
  
  Returns:
  
  the identifier of the token that is next in the queue.
- currentName
  
  public String currentName()
  
  Get the string value of the current name token
  
  Returns:
  
  the string value of the current token, assuming it is a name.
  
  Throws:
  
  ClassCastException - if the current token is not a NameToken
- lookAhead
  
  public void lookAhead() throws XPathException
  
  Look ahead by one token. This method does the real tokenization work. The method is normally called internally, but the XQuery parser also calls it to resume normal tokenization after dealing with pseudo-XML syntax.
  
  Throws:
  
  XPathException - if a lexical error occurs
- startEmbeddedExpression
  
  public void startEmbeddedExpression() throws XPathException
  
  Indicate that we are starting to parse an embedded expression (enclosed in braces) within content that is being read character-by-character. The current position must be immediately after an opening brace. The current tokenization status is saved on a stack, and a new tokenization is started at the current position, with the termination condition set to be the matching closing brace.
  
  Throws:
  
  XPathException - if, for example, a malformed comment is found
- endEmbeddedExpression
  
  public void endEmbeddedExpression() throws XPathException
  
  Indicate that we have finished parsing an embedded expression (within curly braces). The input position must be the closing curly brace, and it is advanced to the next following character. The tokenization state is reset from the saved stack.
  
  Throws:
  
  XPathException
- checkPoint
  
  public Tokenizer checkPoint()
  
  Construct a new tokenizer that includes a snapshot of the current state, so it can be restored later. This mechanism is used to achieve a limited backtracking capability and is not fully general.
  
  Returns:
  
  a snapshot copy of this tokenizer
- rollbackTo
  
  public void rollbackTo(Tokenizer checkPoint)
  
  Restore the state of this tokenizer from a snapshot
  
  Parameters:
  
  checkPoint - the snapshot copy made using the checkPoint() mechanism.
- reposition
  
  public void reposition(int offset)
  
  Reposition for reading characters. Needs care!
- nextChar
  
  public char nextChar()
  
  Read the next character directly. Used by the XQuery parser when parsing pseudo-XML syntax, and also when processing string templates
  
  Returns:
  
  the next character from the input, or NUL at the end of the input
- peekChar
  
  public char peekChar()
  
  Look ahead to see what the next character will be, without changing the current state
  
  Returns:
  
  the next character, or NUL at the end of the input.
- peekChar2
  
  public char peekChar2()
  
  Look ahead to see what the next character but one will be, without changing the current state
  
  Returns:
  
  the next character but one, or NUL at the end of the input.
- incrementLineNumber
  
  public void incrementLineNumber(int offset)
  
  Increment the line number, making a record of where in the input string the newline character occurred.
  
  Parameters:
  
  offset - the place in the input string where the newline occurred
- unreadChar
  
  public void unreadChar()
  
  Step back one character. If this steps back to a previous line, adjust the line number. If we have already read off the end of the input, do nothing.
- getLineNumber
  
  public int getLineNumber()
  
  Get the line number of the current token
  
  Returns:
  
  the line number. Line numbers reported by the tokenizer start at zero.
- getColumnNumber
  
  public int getColumnNumber()
  
  Get the column number of the current token
  
  Returns:
  
  the column number. Column numbers reported by the tokenizer start at zero.
- getLineNumber
  
  public int getLineNumber(int offset)
  
  Return the line number corresponding to a given offset in the expression
  
  Parameters:
  
  offset - the byte offset in the expression
  
  Returns:
  
  the line number. Line and column numbers reported by the tokenizer start at zero.
- getColumnNumber
  
  public int getColumnNumber(int offset)
  
  Return the column number corresponding to a given offset in the expression
  
  Parameters:
  
  offset - the byte offset in the expression
  
  Returns:
  
  the column number. Line and column numbers reported by the tokenizer start at zero.
- readDirectPIConstructor
  
  public boolean readDirectPIConstructor()
- getPrecedingToken
  
  public Token getPrecedingToken()
- readDirectCommentConstructor
  
  public boolean readDirectCommentConstructor()
- readDirectElementConstructor
  
  public boolean readDirectElementConstructor()

Class Tokenizer

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

NUL

END_OF_INPUT

CLOSING_CURLY

currentToken

currentTokenStartOffset

input

inputOffset

isXQuery

languageLevel

allowSaxonExtensions

Constructor Details

Tokenizer

Method Details

finishOnKeyword

setFinishCondition

tokenize

restart

next

peekAhead

currentName

lookAhead

startEmbeddedExpression

endEmbeddedExpression

checkPoint

rollbackTo

reposition

nextChar

peekChar

peekChar2

incrementLineNumber

unreadChar

getLineNumber

getColumnNumber

getLineNumber

getColumnNumber

readDirectPIConstructor

getPrecedingToken

readDirectCommentConstructor

readDirectElementConstructor