Saxonica.com

Binary XML: the PTree file

Saxon-SA 8.5 allows an XML document to be saved on disk in a format referred to as a PTree. This is a binary format designed for speed of loading. A document in PTree format takes about the same amount of disk space as the original source XML, but takes about half as long to load into memory. The saving is greater when the document contains type information, because this is retained in the PTree without the need to revalidate.

Two new commands are available, com.saxonica.ptree.PTreeWriter and com.saxonica.ptree.PTreeReader to convert XML documents into PTrees and vice versa.

A PTree can be supplied as the input to a transformation or query using the class PTreeSource, which implements the JAXP Source interface.

A new command-line option is available on the commands com.saxonica.Transform and com.saxonica.Query. The option -p causes a URIResolver to be used that recognizes the file extension .ptree as representing a Saxon PTree. This option implicitly switches on the -u option, meaning that the source file name is interpreted as a URI. The PTreeURIResolver, as well as recognising the .ptree file extension, also recognizes query parameters at the end of a URI. In particular it recognizes the parameters validation=strict, validation=lax, validation=strip which control how a source document is schema-validated. For example, doc('source.xml?validation=lax') loads a source document with lax validation. This option allows different validation to be applied to different source documents loaded by a single query or transformation.

The result of a query or transformation can be serialized as a PTree by specifying saxon:ptree as the serialization method. From the command line, use the parameter !method={http://saxon.sf.net/}ptree.

The PTree format has been designed so that one Saxon release should normally be able to read PTree files created by an earlier release. It may not always be possible, however, to read PTrees created using a later Saxon release. The PTree is not dependent on any particular NamePool, and can be freely moved between different machines just as source XML can. It is a binary format, so there is no dependency on any particular character encoding or machine architecture. PTree files are not designed to be read or written directly by user applications, nor are they designed to provide an interchange format between Saxon and other products: the internal format is therefore not published.

When a PTree contains type information, the schema that defines those types must also be loaded. This doesn't happen automatically. At present, there is no way of storing a compiled schema on disk, so this will generally involve rebuilding the schema from its source representation. It is the user's responsibility to ensure that the loaded schema is consistent with the schema that was used to validate the original XML document.

For more information see PTree Files.

Next