Handling of source documents

When the source document is supplied as a pre-built tree (in any format), and Saxon strips whitespace text nodes as requested by the stylesheet, the space stripping now takes account of any xml:space attributes present in the tree. Specifically, whitespace text nodes are preserved if xml:space="preserve" is specified. This can be expensive, but is required for conformance. When supplying pre-built trees as input (whether as DOM, JDOM, or XOM trees, or as native Saxon trees) it is best not to use xsl:strip-space in the stylesheet.

When the source document is supplied as a DOM or JDOM tree, multiple adjacent text and CDATA nodes are now mapped to a single text node in the XPath model. If the XPath text node is passed to a Java extension function, the extension function sees the first node in the underlying sequence. This change has not yet been made for XOM trees.

Saxon accepts URIs of the form "document.xml#id" where "id" is the value of an attribute defined in the DTD as being of type ID. It now also accepts such URIs where the fragment identifier is the value of an xml:id attribute.

Where a stylesheet is embedded in a source document, or a schema is embedded within a stylesheet, the base URI of the embedded document was previously taken as being the same as the base URI of the containing document. It is now taken as the base URI of the relevant element. This means that the xml:base attribute is taken into account.

In a previous release, following a change in the W3C specifications, Saxon was changed so that DTD-based types such as ID and IDREF did not set the type annotation on the attribute node. An unintended consequence of this change was that the idref() function stopped working when an attribute was defined in the DTD as being of type IDREF or IDREFS. This has now been fixed. Doing so required some changes to the data model. The is-id and is-idref properties defined in the W3C data model are not reflected directly in the Saxon implementation, but the information is now available in a slightly different way. The method getTypeAnnotation() when applied to an attribute node may now return a value that contains the fingerprint code for the type xs:ID, xs:IDREF, or xs:IDREFS together with a high bit (NodeInfo.IS_DTD_TYPE) indicating that the type is DTD-derived rather than schema-derived. When this bit is set, the value should be treated as being untyped atomic, but the type annotation returned indicates whether the is-id or is-idref properties are present. This same change applies to the type code passed with attributes in the Receiver and PullProvider interfaces.

PTree files

Saxon-SA 8.5 allows an XML document to be saved on disk in a format referred to as a PTree. This is a binary format designed for speed of loading. A document in PTree format takes about the same amount of disk space as the original source XML, but takes about half as long to load into memory. The saving is greater when the document contains type information, because this is retained in the PTree without the need to revalidate.

Two new commands are available, com.saxonica.ptree.PTreeWriter and com.saxonica.ptree.PTreeReader to convert XML documents into PTrees and vice versa.

A PTree can be supplied as the input to a transformation or query using the class PTreeSource, which implements the JAXP Source interface.

A new command-line option is available on the commands com.saxonica.Transform and com.saxonica.Query. The option -p causes a URIResolver to be used that recognizes the file extension .ptree as representing a Saxon PTree. This option implicitly switches on the -u option, meaning that the source file name is interpreted as a URI. The PTreeURIResolver, as well as recognising the .ptree file extension, also recognizes query parameters at the end of a URI. In particular it recognizes the parameters validation=strict, validation=lax, validation=strip which control how a source document is schema-validated. For example, doc('source.xml?validation=lax') loads a source document with lax validation. This option allows different validation to be applied to different source documents loaded by a single query or transformation.

The result of a query or transformation can be serialized as a PTree by specifying saxon:ptree as the serialization method. From the command line, use the parameter !method={http://saxon.sf.net/}ptree.

The PTree format has been designed so that one Saxon release should normally be able to read PTree files created by an earlier release. It may not always be possible, however, to read PTrees created using a later Saxon release. The PTree is not dependent on any particular NamePool, and can be freely moved between different machines just as source XML can. It is a binary format, so there is no dependency on any particular character encoding or machine architecture. PTree files are not designed to be read or written directly by user applications, nor are they designed to provide an interchange format between Saxon and other products: the internal format is therefore not published.

When a PTree contains type information, the schema that defines those types must also be loaded. This doesn't happen automatically. At present, there is no way of storing a compiled schema on disk, so this will generally involve rebuilding the schema from its source representation. It is the user's responsibility to ensure that the loaded schema is consistent with the schema that was used to validate the original XML document.

For more information see PTree Files.