saxonica.com

The PTree File Format

Saxon-SA supports a file format called the PTree (persistent tree). This is a binary representation of an XML document. The PTree file is generally about the same size as the original document (perhaps 10% smaller), but it typically loads in about half the time. Storing a document as a PTree can therefore give a useful performance improvement when the same source document is used repeatedly as the input to many queries or transformations. Another benefit of the PTree is that it retains any type information that is present, which means that the document does not need to be validated against its schema each time it is loaded. (The schema, however, must be loaded whenever the document is loaded.)

Two commands are available for converting XML documents into PTree files and vice versa. To create a PTree, use:

java  com.saxonica.ptree.PTreeWriter source.xml result.ptree

The option -strip causes all whitespace-only text nodes to be stripped in the process, which will often give a useful saving in space and therefore in loading time.

To convert a PTree back to an XML document, use:

java  com.saxonica.ptree.PTreeReader source.ptree result.xml

It is possible to apply a query or transformation directly to a PTree by specifying the -p option on the command line for com.saxonica.Transform or com.saxonica.Query. This option actually causes a different URIResolver, the PTreeURIResolver, to be used in place of the standard URIResolver. The PTreeURIResolver recognizes any URI ending in the extension .ptree as an identifier for a file in PTree format. This extends to files loaded using the doc() or document() functions: if the file extension is .ptree, the file will be assumed to be in PTree format.

The result of a query or transformation can be serialized as a PTree file by specifying saxon:ptree as the output method, where the namespace prefix saxon represents the URI http://saxon.sf.net/.

The PTree format is designed to allow future Saxon releases to read files created using older releases. The converse may not always be true: it might sometimes be impossible for release N to read a PTree file created using release N+1.

The PTree format does not retain the base URI of the original file: when a PTree is loaded, the base URI is taken as the URI of that file, not the original XML file. The PTree is a serialization of the XPath data model, so information that isn't present in the data model will not be present in the PTree: for example, it will have no DTD and no entity references or CDATA sections.

References to unparsed entities are not currently retained in a PTree.

Next