JAXP source types

When a user application invokes SaxonJ via the Java API, then a source document is supplied as an instance of the JAXP Source class. This is true whether invoking an XSLT transformation, an XQuery query, or a free-standing XPath expression. The Source class is essentially a marker interface. The Source that is supplied must be a kind of Source that Saxon recognizes.

SaxonJ recognizes all three kinds of Source defined in JAXP: a StreamSource, a SAXSource, and a DOMSource.

  • When using a StreamSource, note:

    • A StreamSource that wraps an InputStream or Reader can only be used once: it is consumed by use. However, a StreamSource that wraps a File or URI can be used multiple times.
    • Whoever creates an InputStream or Reader is responsible for closing it after use. This means that if Saxon creates an InputStream from a supplied File or URI, it will close that InputStream after use; but if the InputStream is created by the calling application, then the calling application is responsible for closing it. (On some operating systems it is important not to leave unclosed streams lying around.)
    • If the StreamSource wraps an InputStream or Reader, then the base URI of the document is taken from the SystemID property of the StreamSource. If this is not set, then the base URI is unknown, which may cause constructs that require a known base URI to fail.
  • When using a SAXSource, note:

    • If no XMLReader is supplied, Saxon will allocate one, based on settings in the Configuration.
    • Processing of the contained InputSource is entirely the responsibility of the XML parser; Saxon is not involved in this.
    • Saxon will modify properties of the supplied XMLReader: it will set the ContentHandler and LexicalHandler so that it can receive the output of parsing, and it will set the ErrorHandler so it can handle parsing errors.
    • Saxon makes no attempt to ensure that processing of a SAXSource or its underlying XMLReader is thread-safe. The same XMLReader should not be used concurrently in multiple threads.
  • When using a DOMSource, note:

    • The DOM is not thread-safe, even when used in read-only mode. Saxon therefore synchronizes all its access to DOM methods. However, that's no protection if there are application threads accessing the DOM that aren't using Saxon.
    • Saxon can only handle a DOM that is namespace-aware. If you are building the DOM using JAXP interfaces, be sure to set DocumentBuilderFactory.setNamespaceAware(true) (this is not the default!). Saxon cannot reliably detect whether the DOM is namespace aware (it gives a warning for some common problems, but not all) and in general, the results of using a non-namespace aware DOM are unpredictable.
    • If the DOM is created programmatically (rather than being built by parsing lexical XML), then the DOM APIs perform very little checking: for example it is possible to have elements and attributes with invalid names. Saxon makes no attempt to check for such conditions, and may produce unpredictable results.
    • The base URI of the document is taken from the SystemID property of the DOMSource. If this is not set, then the base URI is unknown, which may cause constructs that require a known base URI to fail.
    • Saxon's native TinyTree model is faster than DOM by a factor of 5 to 10 in typical XPath searches. Don't use the DOM with Saxon unless you have a very good reason.
    • From Saxon 9.8, Saxon-EE uses a new mechanism for processing DOM trees, called the Domino model. This involves creating an index of all the nodes in the DOM, providing for faster navigation. Saxon-PE and Saxon-HE continue to use the DOM NodeWrapper model, where DOM methods are used to navigate the tree. A transformation using the Domino model is still slower than Saxon's native TinyTree, but only by a factor of two. It also uses a lot more memory.

Other kinds of Source that are recognized by most Saxon interfaces are:

  • TreeInfo: Saxon's TreeInfo holds information about a document (or more generally any tree of nodes), and can be used directly as a Source of a transformation.
  • NodeInfo: Saxon's NodeInfo represents a node in a tree, and can be used directly as a Source of a transformation.
  • StaxSource: allows a pull parser to be used.
  • PullSource: Saxon's internal pull interface.
  • EventSource: Similar to an XMLReader,but with a much simpler interface, an EventSource has a send() method that sends a stream of events to a Saxon Receiver.
  • SaplingDocument: a sapling tree constructed using the sapling construction interface can be used anywhere (within Saxon) that a Source is expected.

Saxon also accepts input from an XMLStreamReader (javax.xml.stream.XMLStreamReader), that is a StAX pull parser as defined in JSR 173. This is achieved by creating an instance of net.sf.saxon.pull.StaxBridge, supplying the XMLStreamReader using the setXMLStreamReader() method, and wrapping the StaxBridge object in an instance of net.sf.saxon.pull.PullSource, which implements the JAXP Source interface and can be used in any Saxon method that expects a Source. Saxon has been validated with two StAX parsers: the Zephyr parser from Sun (which is supplied as standard with JDK 1.6), and the open-source Woodstox parser from Tatu Saloranta. In Saxonica's experience, Woodstox is the more reliable of the two. However, there is no immediate benefit in using a pull parser to supply Saxon input rather than a push parser; the main use case for using an XMLStreamReader is when the data is supplied from some source other than parsing of lexical XML.

Nodes in Saxon's implementation of the XPath data model are represented by the interface NodeInfo. A NodeInfo is itself a Source, which means that any method in the API that requires a source object will accept any implementation of NodeInfo. As discussed in the next section, implementations of NodeInfo are available to wrap Axiom, DOM, DOM4J, JDOM2, or XOM nodes, and in all cases these wrapper objects can be used wherever a Source is required.

Saxon also provides a class net.sf.saxon.lib.AugmentedSource which implements the Source interface. This class encapsulates one of the standard Source objects, and allows additional processing options to be specified. These options include whitespace handling, schema and DTD validation, XInclude processing, error handling, choice of XML parser, and choice of Saxon tree model.

Saxon allows additional Source types to be supported by registering a SourceResolver with the Configuration object. The task of a SourceResolver is to convert a Source that Saxon does not recognize into a Source that it does recognize. For example, this may be done by building the document tree in memory and returning the NodeInfo object representing the root of the tree.