XML parsing in SaxonCS

XML parsing in SaxonCS is delegated to the System.Xml parser supplied with the .NET platform.

The Microsoft parser has some limitations:

It does not notify ID or IDREF values from the DTD, or expand attributes with fixed or default values, unless DTD validation is requested. This can be requested using the -v option on the command line, or via the API. (See the DtdValidation property of the DocumentBuilder class.)
It rejects documents that specify <?xml version="1.1"?>, or that use namespace undeclarations in the form xmlns:p="".
It does not notify unparsed entities to Saxon. The XSLT functions unparsed-entity-uri() and unparsed-entity-public-id() will therefore not work with the standard Saxon tree models.

A DOM tree built using the Microsoft parser (an XmlDocument instance) does however contain information about unparsed entities, provided that DTD processing is enabled. From Saxon 13, the unparsed-entity-uri and unparsed-entity-system-id functions therefore work correctly provided that the Saxon XDM tree is built as a wrapper over an XmlDocument. Note that this carries a significant performance penalty compared with Saxon's native tree implementation.
It does not notify changes of base URI (for example, at entity boundaries). In principle, Saxon could interrogate the parser to determine the base URI of each element as it is delivered. Currently this is not done, so in the absence of xml:base attributes, all elements in a document will have the same base URI, regardless of external entity boundaries.

There are several ways the System.Xml parser can be used:

In interfaces where a source XML document is provided by supplying a URI, or a Stream, or a TextReader, Saxon will invoke the Microsoft parser to parse the content. In this case the parser operates as a stream-based (pull-mode) parser, notifying parsing events to Saxon as they occur. Saxon may build a tree representation of the document internally, or it may process the data in streamed mode. This is generally more efficient than supplying a DOM.
Some interfaces also allow you to supply input in the form of an XmlReader. This allows you to control the settings and options applied to the XmlReader. Some settings may produce behavior that is not conformant with the W3C specifications (for example, switching character checking off), and which could potentially cause Saxon to behave unpredictably.
You can also use the System.Xml parser to construct an in-memory DOM tree (represented by an XmlDocument or XmlNode). Some Saxon interfaces accept input in the form of an XmlDocument or XmlNode. There are two ways Saxon can handle a DOM:
- The DOM can be copied to an internal Saxon tree structure.
  
  Note that there's no point constructing a DOM just so Saxon can rebuild it: it's better to let Saxon parse the XML and build its own tree. However, this option is useful if you are using a DOM for other reasons, for example if your application has other parts that are DOM-based.
- The DOM can be wrapped as a Saxon tree. This avoids the cost and memory overhead of copying the tree, but the result is slower to navigate.
- (SaxonCS does not currently offer the Domino model, which is a hybrid between copying and wrapping: Domino on SaxonJ uses the DOM as supplied, but adds indexes for fast searching.)

No other XML parser is currently supported, although it is in principle possible to plug in a third party parser provided it is capable of delivering a Saxon XdmNode or a DOM XmlNode.

SaxonCS automatically uses the XmlResolver library, which resolves a number of well-known W3C URIs to local copies of the relevant resources, and which can be configured to use a catalog file defining local locations for other URIs. Saxon configures the Microsoft XML parser to retrieve resources via this library.

HTML parsing

Parsing of HTML5 documents can be achieved by calling the XPath 4.0 function fn:parse-html(). Since SaxonCS 12, Saxon parses the supplied HTML content using AngleSharp, returning a Saxon wrapper around the resulting AngleSharp.Dom.IDocument node, from which XPath navigation is possible.

SaxonCS 11 used HtmlAgilityPack as the underlying parser. As well as changing the underlying technology, the function is also much more thoroughly tested since SaxonCS 12, and conformance with HTML5 has been checked by comparing the results of nearly 1400 test cases with other implementations. About 30 of these tests return different results across SaxonJ and SaxonCS; these are being investigated. The differences are almost exclusively concerned with the way in which the parser repairs erroneous HTML.

Normal HTML elements (such as <p>, <div>, etc) are delivered as element nodes whose local name is in lower-case, with the XHTML namespace URI.