Saxonica.com

Pull Processing

Saxon 8.3 contains some new classes to support a pull pipeline. At present this should be regarded as preliminary and experimental; it provides some new ways of providing input to Saxon and reading results from Saxon, but plays no significant role within the product architecture yet. Interfaces are likely to change.

A new interface, PullProvider is included. This interface is modelled on the XMLStreamReader interface that forms part of StAX, but modified to use Saxon concepts such as NamePools and SequenceIterators. This interface allows a caller to read an XML document by a sequence of calls on the method next(): each such call advances the position of a cursor and makes information available about the current context. Typically, next() reports that it has read the start of an element, a text node, a comment, the end of an element node, and so on. Attributes and namespaces are not reported as events, but information about them is available to the caller immediately after the START_ELEMENT event is notified.

The PullProvider can in fact read any XPath sequence, containing nodes and atomic values. When a node is encountered, the client can "drill down" to get the events within the subtree rooted at that node. (Alternatively, the client can skip the node and move on). It is not possible to navigate in arbitrary directions from the node, because the node may have no real existence in memory: this is a streaming interface.

A class PullSource is available that wraps a PullProvider as a JAXP Source object. This allows any PullProvider to be supplied as input to a transformation or query.

It is possible to obtain a PullProvider that reads the contents of an Saxon tree, starting at a given node. There are two variants of this: TreeWalker which can handle any tree (that is, any implementation of the NodeInfo interface), and TinyTreeWalker, which is optimized for the TinyTree.

It is possible to bridge between Saxon's pull and push interfaces using a PullPushCopier. This reads events from a PullProvider and sends equivalent events to a Receiver.

A PullProvider is available that interfaces to a StAX pull-parser. This class is called StaxBridge. It has been tested with pull parsers from BEA and Sun. Both these parsers are currently early releases and have been found to be rather buggy: no doubt they will improve in subsequent versions.

The StaxBridge class is the only class in Saxon that depends on the presence of the StAX API. For this reason, it is not bundled as part of the general saxon8.jar file. Instead, it is included for the time being in the samples directory. There is no dependency on any particular StAX parser: it will pick up whatever parser is on the classpath, or selected using the relevant Java system properties.

A class PullFilter is available that simply joins two PullProviders end-to-end. This can be subclassed (in the same way as the XMLFilter class in SAX) to provide a wide variety of components that analyze or modify the event stream. This allows pull pipelines to be built in very much the same way as Saxon's existing push pipelines.

An eventual aim of this work is to enable tree-construction expressions to be evaluated in pull mode. This will allow lazy evaluation of trees in the same way as Saxon currently makes heavy use of lazy evaluation of sequences. For example, given a construct such as <e a="{$x}"/> (in either XSLT or XQuery), Saxon would be able to return to the caller a sequence of events (in this case, just a start-element and end-element event) without ever building a tree in memory. This is similar to what happens today using the push pipeline when writing a final result tree, especially in XSLT. For XQuery, however, where it is more common to construct many small intermediate trees, being able to switch between pull and push processing for such expressions offers considerable advantages.

Next