saxonica.com

Document Projection

Document Projection is a mechanism that analyzes a query to determine what parts of a document it can potentially access, and then while building a tree to represent the document, leaves out those parts of the tree that cannot make any difference to the result of the query.

Document projection is available only in Saxon-SA

In this release document projection is an option on the XQuery command line interface: set -projection:on. Currently it is only used if requested. The command line option affects both the primary source document supplied on the command line, and any calls on the doc() function within the body of the query.

For feedback on the impact of document projection in terms of reducing the size of the source document in memory, use the -t option on the command line, which shows for each document loaded how many nodes from the input document were retained and how many discarded.

The more complex the query, the less likely it is that Saxon will be able to analyze it to determine the subset of the document required. If precise analysis is not possible, document projection has no effect. Currently Saxon makes no attempt to analyze accesses made within user-defined functions. Also, of course, Saxon cannot analyze the expectations of external (Java) functions called from the query.

Currently document projection is supported only for XQuery, and it works only when a document is parsed and loaded for the purpose of executing a single query. It is possible, however, to use the mechanism to create a manual filter for source documents if the required subset of the document is known. To achieve this, create a query that selects the required parts of the document, and compile it to an XQueryExpression. The query does not have to do anything useful: the only requirement is that the result of the query on the subset document must be the same as the result on the original document. It's simplest to use a simple path expression that starts with a call on doc() with a literal URI, and then selects downwards. Then call the getPathMap() method on the XQueryExpression to obtain a net.sf.saxon.expr.PathMap object. Call the method getRootForDocument() on the PathMap object, supplying the document URI used in the call to doc(). This returns a PathMapRoot, which can in turn be passed as an argument to configuration.makeDocumentProjector(). This returns a ProxyReceiver which acts as an event filter; you can register this filter with an AugmentedSource and supply this to any interface that builds a source document. For example:


Configuration config = new EnterpriseConfiguration();
StaticQueryContext sqc = new StaticQueryContext(config);
XQueryExpressaion exp = sqc.compileQuery(
    "doc('file:///c:/sample.xml')//chapter/title)");
PathMap map = exp.getPathMap();
ProxyReceiver filter = config.makeDocumentProjector(
    map.getRootForDocument("file:///c:/sample.xml");
StreamSource ss = new StreamSource("file:///c:/sample.xml");    
AugmentedSource as = AugmentedSource.makeAugmentedSource(ss);
ss.addFilter(filter);
DocumentInfo doc = config.buildDocument(as);    

Of course, when document projection is used manually like this then it entirely a user responsibility to ensure that the selected part of the document contains all the nodes required.

If the query supplied as input to the path map selects nodes, then Saxon assumes that the application will need access to the entire subtree rooted at these nodes, but that it will not attempt to navigate upwards from these nodes. On the other hand, nodes that are atomized (for example in a filter) will be retained without their descendants, except as needed to compute the filter.