Controlling Parsing of Source Documents
Saxon does not include its own XML parser. By default:
On the Java platform, the default SAX parser provided as part of the JDK is used. With the Sun/Oracle JDK, this is a variant of the Apache Xerces parser customized by Sun.
On the .NET platform, Saxon includes a copy of the Apache Xerces parser cross-compiled to run on .NET.
An error reported by the XML parser is generally fatal. It is not possible to process ill-formed XML.
There are several ways you can cause a different XML parser to be used:
-yoptions on the command line can be used to specify the class name of a SAX parser, which Saxon will load in preference to the default SAX parser. The
-xoption is used for source XML documents, the
-yoption for schemas and stylesheets. The equivalent options can be set programmatically or by using the configuration file.
By default Saxon uses the
SAXParserFactorymechanism to load a parser. This can be configured by setting the system property
javax.xml.parsers.SAXParserFactory, by means of the file
lib/jaxp.propertiesin the JRE directory, or by adding another parser to the
The source for parsing can be supplied in the form of a
SAXSourceobject, which has an
XMLReaderproperty containing the parser instance to be used.
On .NET, the configuration option
PREFER_JAXP_PARSERcan be set to false, in which case Saxon will use the Microsoft XML parser instead of the Apache parser. (This parser is not used by default because it does not notify
IDattributes to the application, which means the XPath
idref()functions do not work.)
For a document read using the
document()functions, the parser (XMLReader) to be used can be specified using the query parameter
?parser=full.class.namein the document URI -- but only if the
StandardURIResolveris used, and the feature is enabled by calling
Configuration.setParameterizedURIResolver()or by setting
Transformcommand lines. For example,
parser=org.ccil.cowan.tagsoup.Parsercauses John Cowan's TagSoup parser for HTML to be used.
Saxonica traditionally recommended use of the Xerces parser from Apache in preference to the version bundled in the JDK, which was known to have some serious bugs. However, there is some evidence that the version bundled in Java 8 is more reliable.
By default, Saxon invokes the parser in non-validating mode (that is, without requested DTD
validation). Note however, that the parser still needs to read the DTD if one is present,
because it may contain entity definitions that need to be expanded. DTD validation can be
-dtd:on on the command line, or equivalent API or configuration
Saxon is issued with local copies of commonly-used W3C DTDs such as the XHTML, SVG, and
MathML DTDs. When Saxon itself instantiates the XML parser, it will use an
EntityResolver that causes these local copies of DTDs to be used rather than
fetching public copies from the web (the W3C servers are increasingly failing to serve these
requests as the volume of traffic is too high). It is possible to override this using the
ENTITY_RESOLVER_CLASS, which can be set to the name of a
EntityResolver, or to the empty string to indicate that no
EntityResolver should be used. Saxon will not add this
EntityResolver in cases where the XML parser instance is supplied by the caller
as part of a
SAXSource object. It will add it to a parser obtained as an instance
of the class specified using the
-y command line options,
unless either the use of the
EntityResolver is suppressed using the
ENTITY_RESOLVER_CLASS configuration option, or the instantiated parser already
Saxon never asks the XML parser to perform schema validation. If schema validation is
required it should be requested using the command line options
-val:lax, or their API equivalents. Saxon will then use its own schema
processor to validate the document as it emerges from the XML parser. Schema processing is
done in parallel with parsing, by use of a SAX-like pipeline.