Building a source document from lexical XML

The conversion of lexical XML to a tree in memory is called parsing, and is performed by a software component called an XML Parser. Saxon does not include its own XML parser, rather it provides interfaces that invoke XML parsers supplied by third parties. Platforms such as Java and .NET typically include a built-in XML parser that Saxon uses by default.

A source document can be built using the DocumentBuilder class (PyDocumentBuilder in Python), which is created using the factory method newDocumentBuilder (NewDocumentBuilder() in C#, and new_document_builder() in Python) on the Processor object (SaxonProcessor in C++ and PHP, PySaxonProcessor in Python). Various options for document building are available as methods on the DocumentBuilder, for example options to perform schema or DTD validation, to strip whitespace, to expand XInclude directives, and also to choose the tree implementation model to be used.

On Java, the different ways of supplying input to the DocumentBuilder are represented using the JAXP Source object. This is a JAXP interface designed as an abstraction of various kinds of XML source, including:

  • StreamSource, which represents lexical XML held in a file or input stream.
  • SAXSource, which represents a source of SAX events.
  • DOMSource, representing an already-parsed XML document held in a DOM tree.
  • StAXSource, which represents a class that responds to requests for STAX (pull-parser) events.
  • ActiveSource, a Saxon extension of the JAXP Source interface, allowing you to define your own kind of input source: all you need to do is implement the method deliver(), which delivers the contents of the resource to a Saxon Receiver.
  • NodeInfo, representing a node in an XDM tree, implements ActiveSource.

In addition, the s9api XdmNode class has an asSource() method, so it is always possible to supply an existing Saxon tree as the source for any of these interfaces.

On .NET, the DocumentBuilder has an overloaded Build method allowing input to be supplied from sources of the following kinds:

  • Stream, containing lexical XML as a stream of bytes.
  • TextReader, containing lexical XML as a stream of characters.
  • Uri, which can be dereferenced to fetch a stream of bytes.
  • XmlNode, Microsoft's DOM implementation.
  • XmlReader, an XML parser, primed with a source of input.

On C++, Python and PHP, source XML documents can be parsed from file or lexical string using methods on the DocumentBuilder:

  • In C++, DocumentBuilder provides the methods: parseXmlFromFile(), parseXmlFromUri(), and parseXmlFromString().
  • In Python, PyDocumentBuilder provides the method parse_xml() with which different keywords are used to supply the input from file, URI or lexical string.
  • In PHP, Saxon\DocumentBuilder provides the methods: parseXmlFromFile(), parseXmlFromUri(), and parseXmlFromString().

It is also possible to parse XML documents directly from the SaxonProcessor without creating a DocumentBuilder, using the methods of the same names as above. However this only allows simple parsing, not any of the additional parsing options available with DocumentBuilder.

All the documents processed in a single transformation or query must be loaded using the same Configuration. However, it is possible to copy a document from one Configuration into another by supplying the TreeInfo at the root of the existing document as the Source supplied to the buildDocumentTree() method of the new Configuration.

With SaxonC, it is only possible to copy a document from one Configuration into another by serializing the object and parsing it again.