XML parsing in SaxonCS

XML parsing in SaxonCS is delegated to the System.Xml parser supplied with the .NET platform.

The Microsoft parser has some limitations:

  • It does not notify ID or IDREF values from the DTD, or expand attributes with fixed or default values, unless DTD validation is requested. This can be requested using the -v option on the command line, or via the API. (See the DtdValidation property of the DocumentBuilder class.)

  • It rejects documents that specify <?xml version="1.1"?>, or that use namespace undeclarations in the form xmlns:p="".

  • It does not notify unparsed entities to Saxon. The XSLT functions unparsed-entity-uri() and unparsed-entity-public-id() will therefore not work.

  • It does not notify changes of base URI (for example, at entity boundaries). In principle, Saxon could interrogate the parser to determined the base URI of each element as it is delivered. Currently this is not done, so in the absence of xml:base attributes, all elements in a document will have the same base URI, regardless of external entity boundaries.

There are several ways the System.Xml parser can be used:

  • In interfaces where a source XML document is provided by supplying a URI, or a Stream, or a TextReader, Saxon will invoke the Microsoft parser to parse the content. In this case the parser operates as a stream-based (pull-mode) parser, notifying parsing events to Saxon as they occur. Saxon may build a tree representation of the document internally, or it may process the data in streamed mode. This is generally more efficient than supplying a DOM.

  • Some interfaces also allow you to supply input in the form of an XmlReader. This allows you to control the settings and options applied to the XmlReader. Some settings may produce behavior that is not conformant with the W3C specifications (for example, switching character checking off), and which could potentially cause Saxon to behave unpredictably.

  • You can also use the System.Xml parser to construct an in-memory DOM tree (represented by an XmlDocument or XmlNode). Some Saxon interfaces accept input in the form of an XmlDocument or XmlNode. There are two ways Saxon can handle a DOM:

    • The DOM can be copied to an internal Saxon tree structure.

      Note that there's no point constructing a DOM just so Saxon can rebuild it: it's better to let Saxon parse the XML and build its own tree. However, this option is useful if you are using a DOM for other reasons, for example if your application has other parts that are DOM-based.

    • The DOM can be wrapped as a Saxon tree. This avoids the cost and memory overhead of copying the tree, but the result is slower to navigate.

    • (SaxonCS does not currently offer the Domino model, which is a hybrid between copying and wrapping: it uses the DOM as supplied, but adds indexes for fast searching.)

To use URI catalogs and locally-cached documents with the Microsoft parser, download nuget package Org.XmlResolver and nominate this as your XmlResolver before invoking the parser.

No other XML parser is currently supported, although it is in principle possible to plug in a third party parser provided it is capable of delivering a Saxon XdmNode or a DOM XmlNode.

HTML parsing

The saxon:parse-html() extension function is available. It parses the supplied HTML content using HTML Agility Pack, returning a Saxon wrapper around the resulting HtmlDocument node, from which XPath navigation is possible.

Normal HTML elements (such as <p>, <div>, etc) are delivered as element nodes whose local name is in lower-case, with the XHTML namespace URI.