XML parsing in SaxonCS
XML parsing in SaxonCS is delegated to
System.Xml parser supplied with the .NET platform.
The Microsoft parser has some limitations:
It does not notify ID or IDREF values from the DTD, or expand attributes with fixed or default values, unless DTD validation is requested. This can be requested using the
-voption on the command line, or via the API. (See the
DtdValidationproperty of the DocumentBuilder class.)
It rejects documents that specify
<?xml version="1.1"?>, or that use namespace undeclarations in the form
It does not notify changes of base URI (for example, at entity boundaries). In principle, Saxon could interrogate the parser to determined the base URI of each element as it is delivered. Currently this is not done, so in the absence of
xml:baseattributes, all elements in a document will have the same base URI, regardless of external entity boundaries.
There are several ways the
System.Xml parser can be used:
In interfaces where a source XML document is provided by supplying a URI, or a
Stream, or a
TextReader, Saxon will invoke the Microsoft parser to parse the content. In this case the parser operates as a stream-based (pull-mode) parser, notifying parsing events to Saxon as they occur. Saxon may build a tree representation of the document internally, or it may process the data in streamed mode. This is generally more efficient than supplying a DOM.
Some interfaces also allow you to supply input in the form of an
XmlReader. This allows you to control the settings and options applied to the
XmlReader. Some settings may produce behavior that is not conformant with the W3C specifications (for example, switching character checking off), and which could potentially cause Saxon to behave unpredictably.
You can also use the
System.Xmlparser to construct an in-memory DOM tree (represented by an
XmlNode). Some Saxon interfaces accept input in the form of an
XmlNode. There are two ways Saxon can handle a DOM:
The DOM can be copied to an internal Saxon tree structure.
Note that there's no point constructing a DOM just so Saxon can rebuild it: it's better to let Saxon parse the XML and build its own tree. However, this option is useful if you are using a DOM for other reasons, for example if your application has other parts that are DOM-based.
The DOM can be wrapped as a Saxon tree. This avoids the cost and memory overhead of copying the tree, but the result is slower to navigate.
(SaxonCS does not currently offer the Domino model, which is a hybrid between copying and wrapping: Domino on SaxonJ uses the DOM as supplied, but adds indexes for fast searching.)
To use URI catalogs and locally-cached documents with the Microsoft parser,
download nuget package
and nominate this as your
XmlResolver before invoking the parser.
No other XML parser is currently supported, although it is in principle possible to plug in a third party
parser provided it is capable of delivering a Saxon
XdmNode or a DOM
Parsing of HTML5 documents can be achieved by calling either of the functions
saxon:parse-html() or (if 4.0 extensions are enabled)
fn:parse-html(). Both forms are identical. The function has changed substantially
between SaxonCS 11 and SaxonCS 12; whereas SaxonCS 11 used
HtmlAgilityPack as the underlying
parser, SaxonCS 12 uses
AngleSharp, which conforms much more closely to the standard HTML5 parsing algorithm.
Saxon parses the supplied HTML content using
AngleSharp, returning a Saxon wrapper around the resulting
AngleSharp.Dom.IDocument node, from which XPath navigation is possible.
Normal HTML elements (such as
<div>, etc) are delivered
as element nodes whose local name is in lower-case, with the XHTML namespace URI.