Class Tokenizer
Heavily modified in Saxon 13, based on the principles in the XQuery 4.0 specification.
Tokenization is independent of the syntactic context: the tokenizer no longer attempts to
distinguish whether '*', for example, is a multiplication operator, a wildcard,
or an occurrence indicator; it leaves that decision to the parser.
The main complication is "complex tokens": constructs such as string templates, string
constructors, and XQuery direct element constructors that contain embedded expressions,
always delimited by curly braces. The tokenizer recognises complex tokens by their initial
characters, and returns an appropriate Token object to the parser, which then starts
reading its content using calls on nextChar() calls. When an open brace
identifying an embedded expression is encountered, the parser calls
startEmbeddedExpression(), which has the effect that when the matching
closing brace is encountered, and "end of expression" is signalled back to the parser
-
Field Summary
FieldsModifier and TypeFieldDescriptionbooleanFlag to allow Saxon extensionsPredicate indicating that tokenization should stop when a "matching" closing curly brace is foundToken indicating end of tokenized input (not necessarily the end of the input string)intThe position in the input string where the current token startsPredicate indicating that tokenization should stop when the end of the input is reachedThe string being parsedintThe current position within the input stringbooleanFlag to indicate that this is XQuery as distinct from XPathintXPath (or XQuery) language level: e.g.static final char -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionConstruct a new tokenizer that includes a snapshot of the current state, so it can be restored later.Get the string value of the current name tokenvoidIndicate that we have finished parsing an embedded expression (within curly braces).finishOnKeyword(String keyword) Get a predicate to indicate that tokenization should stop when a particular keyword is encountered.intGet the column number of the current tokenintgetColumnNumber(int offset) Return the column number corresponding to a given offset in the expressionintGet the line number of the current tokenintgetLineNumber(int offset) Return the line number corresponding to a given offset in the expressionvoidincrementLineNumber(int offset) Increment the line number, making a record of where in the input string the newline character occurred.voidLook ahead by one token.voidnext()Get the next token from the input expression.charnextChar()Read the next character directly.Peek ahead at the next tokencharpeekChar()Look ahead to see what the next character will be, without changing the current statecharLook ahead to see what the next character but one will be, without changing the current statebooleanbooleanbooleanvoidreposition(int offset) Reposition for reading characters.voidrestart()Restart tokenisation after, for example, a direct element constructorvoidrollbackTo(Tokenizer checkPoint) Restore the state of this tokenizer from a snapshotvoidsetFinishCondition(Predicate<Tokenizer> condition) Set the condition that is used to decide when tokenization is completevoidIndicate that we are starting to parse an embedded expression (enclosed in braces) within content that is being read character-by-character.voidPrepare a string for tokenization.voidStep back one character.
-
Field Details
-
NUL
public static final char NUL- See Also:
-
END_OF_INPUT
Predicate indicating that tokenization should stop when the end of the input is reached -
CLOSING_CURLY
Predicate indicating that tokenization should stop when a "matching" closing curly brace is found -
currentToken
Token indicating end of tokenized input (not necessarily the end of the input string) -
currentTokenStartOffset
public int currentTokenStartOffsetThe position in the input string where the current token starts -
input
The string being parsed -
inputOffset
public int inputOffsetThe current position within the input string -
isXQuery
public boolean isXQueryFlag to indicate that this is XQuery as distinct from XPath -
languageLevel
public int languageLevelXPath (or XQuery) language level: e.g. 2.0, 3.0, 3.1, 4.0 (times ten, as an integer) -
allowSaxonExtensions
public boolean allowSaxonExtensionsFlag to allow Saxon extensions
-
-
Constructor Details
-
Tokenizer
public Tokenizer()
-
-
Method Details
-
finishOnKeyword
Get a predicate to indicate that tokenization should stop when a particular keyword is encountered. Used specifically in Gizmo.- Parameters:
keyword- the keyword that signals the end of an expression- Returns:
- a suitable predicate
-
setFinishCondition
Set the condition that is used to decide when tokenization is complete- Parameters:
condition- the completion condition. This condition is tested during lookAhead() processing. The call on lookAhead() first reads past all whitespace and comments, then tests the finish condition, and if the finish condition is satisfied at that point, it sets the next (pending) token to Token.EOF.
-
tokenize
Prepare a string for tokenization. The actual tokens are obtained by calls on next()- Parameters:
input- the string to be tokenizedstart- start point within the stringend- end point within the string (last character not read): -1 means end of string- Throws:
XPathException- if a lexical error occurs, e.g. unmatched string quotes
-
restart
Restart tokenisation after, for example, a direct element constructor- Throws:
XPathException
-
next
Get the next token from the input expression. The type of token is returned in the currentToken variable, the string value of the token in currentTokenValue.- Throws:
XPathException- if a lexical error is detected
-
peekAhead
Peek ahead at the next token- Returns:
- the identifier of the token that is next in the queue.
-
currentName
Get the string value of the current name token- Returns:
- the string value of the current token, assuming it is a name.
- Throws:
ClassCastException- if the current token is not a NameToken
-
lookAhead
Look ahead by one token. This method does the real tokenization work. The method is normally called internally, but the XQuery parser also calls it to resume normal tokenization after dealing with pseudo-XML syntax.- Throws:
XPathException- if a lexical error occurs
-
startEmbeddedExpression
Indicate that we are starting to parse an embedded expression (enclosed in braces) within content that is being read character-by-character. The current position must be immediately after an opening brace. The current tokenization status is saved on a stack, and a new tokenization is started at the current position, with the termination condition set to be the matching closing brace.- Throws:
XPathException- if, for example, a malformed comment is found
-
endEmbeddedExpression
Indicate that we have finished parsing an embedded expression (within curly braces). The input position must be the closing curly brace, and it is advanced to the next following character. The tokenization state is reset from the saved stack.- Throws:
XPathException
-
checkPoint
Construct a new tokenizer that includes a snapshot of the current state, so it can be restored later. This mechanism is used to achieve a limited backtracking capability and is not fully general.- Returns:
- a snapshot copy of this tokenizer
-
rollbackTo
Restore the state of this tokenizer from a snapshot- Parameters:
checkPoint- the snapshot copy made using thecheckPoint()mechanism.
-
reposition
public void reposition(int offset) Reposition for reading characters. Needs care! -
nextChar
public char nextChar()Read the next character directly. Used by the XQuery parser when parsing pseudo-XML syntax, and also when processing string templates- Returns:
- the next character from the input, or NUL at the end of the input
-
peekChar
public char peekChar()Look ahead to see what the next character will be, without changing the current state- Returns:
- the next character, or NUL at the end of the input.
-
peekChar2
public char peekChar2()Look ahead to see what the next character but one will be, without changing the current state- Returns:
- the next character but one, or NUL at the end of the input.
-
incrementLineNumber
public void incrementLineNumber(int offset) Increment the line number, making a record of where in the input string the newline character occurred.- Parameters:
offset- the place in the input string where the newline occurred
-
unreadChar
public void unreadChar()Step back one character. If this steps back to a previous line, adjust the line number. If we have already read off the end of the input, do nothing. -
getLineNumber
public int getLineNumber()Get the line number of the current token- Returns:
- the line number. Line numbers reported by the tokenizer start at zero.
-
getColumnNumber
public int getColumnNumber()Get the column number of the current token- Returns:
- the column number. Column numbers reported by the tokenizer start at zero.
-
getLineNumber
public int getLineNumber(int offset) Return the line number corresponding to a given offset in the expression- Parameters:
offset- the byte offset in the expression- Returns:
- the line number. Line and column numbers reported by the tokenizer start at zero.
-
getColumnNumber
public int getColumnNumber(int offset) Return the column number corresponding to a given offset in the expression- Parameters:
offset- the byte offset in the expression- Returns:
- the column number. Line and column numbers reported by the tokenizer start at zero.
-
readDirectPIConstructor
public boolean readDirectPIConstructor() -
getPrecedingToken
-
readDirectCommentConstructor
public boolean readDirectCommentConstructor() -
readDirectElementConstructor
public boolean readDirectElementConstructor()
-