XML parsing in SaxonCS
XML parsing in SaxonCS is delegated to
the System.Xml parser supplied with the .NET platform.
The Microsoft parser has some limitations:
It does not notify ID or IDREF values from the DTD, or expand attributes with fixed or default values, unless DTD validation is requested. This can be requested using the
-voption on the command line, or via the API. (See theDtdValidationproperty of the DocumentBuilder class.)It rejects documents that specify
<?xml version="1.1"?>, or that use namespace undeclarations in the formxmlns:p="".It does not notify unparsed entities to Saxon. The XSLT functions unparsed-entity-uri() and unparsed-entity-public-id() will therefore not work with the standard Saxon tree models.
A DOM tree built using the Microsoft parser (an
XmlDocumentinstance) does however contain information about unparsed entities, provided that DTD processing is enabled. From Saxon 13, theunparsed-entity-uriandunparsed-entity-system-idfunctions therefore work correctly provided that the Saxon XDM tree is built as a wrapper over anXmlDocument. Note that this carries a significant performance penalty compared with Saxon's native tree implementation.It does not notify changes of base URI (for example, at entity boundaries). In principle, Saxon could interrogate the parser to determine the base URI of each element as it is delivered. Currently this is not done, so in the absence of
xml:baseattributes, all elements in a document will have the same base URI, regardless of external entity boundaries.
There are several ways the System.Xml parser can be used:
In interfaces where a source XML document is provided by supplying a URI, or a
Stream, or aTextReader, Saxon will invoke the Microsoft parser to parse the content. In this case the parser operates as a stream-based (pull-mode) parser, notifying parsing events to Saxon as they occur. Saxon may build a tree representation of the document internally, or it may process the data in streamed mode. This is generally more efficient than supplying a DOM.Some interfaces also allow you to supply input in the form of an
XmlReader. This allows you to control the settings and options applied to theXmlReader. Some settings may produce behavior that is not conformant with the W3C specifications (for example, switching character checking off), and which could potentially cause Saxon to behave unpredictably.You can also use the
System.Xmlparser to construct an in-memory DOM tree (represented by anXmlDocumentorXmlNode). Some Saxon interfaces accept input in the form of anXmlDocumentorXmlNode. There are two ways Saxon can handle a DOM:The DOM can be copied to an internal Saxon tree structure.
Note that there's no point constructing a DOM just so Saxon can rebuild it: it's better to let Saxon parse the XML and build its own tree. However, this option is useful if you are using a DOM for other reasons, for example if your application has other parts that are DOM-based.
The DOM can be wrapped as a Saxon tree. This avoids the cost and memory overhead of copying the tree, but the result is slower to navigate.
(SaxonCS does not currently offer the Domino model, which is a hybrid between copying and wrapping: Domino on SaxonJ uses the DOM as supplied, but adds indexes for fast searching.)
No other XML parser is currently supported, although it is in principle possible to plug in a third party
parser provided it is capable of delivering a Saxon XdmNode or a DOM XmlNode.
SaxonCS automatically uses the XmlResolver library, which resolves a number of well-known W3C URIs to local copies of the relevant resources, and which can be configured to use a catalog file defining local locations for other URIs. Saxon configures the Microsoft XML parser to retrieve resources via this library.
HTML parsing
Parsing of HTML5 documents can be achieved by calling the XPath 4.0 function
fn:parse-html(). Since SaxonCS 12,
Saxon parses the supplied HTML content using AngleSharp, returning a Saxon wrapper around the resulting
AngleSharp.Dom.IDocument node, from which XPath navigation is possible.
Normal HTML elements (such as <p>, <div>, etc) are delivered
as element nodes whose local name is in lower-case, with the XHTML namespace URI.