Parses HTML supplied as a string.
parse-html($html as xs:string) ➔ document-node()
The HTML content as a string
Requires Saxon-PE or Saxon-EE.
Notes on the Saxon implementation
Available since Saxon 9.2.
This function takes a single argument, a string containing the source text of an HTML document. It returns the document node (root node) that results from parsing this text using the TagSoup parser.
On the Java platform, the TagSoup jar file must be on the classpath. It may be downloaded from https://mvnrepository.com/artifact/org.ccil.cowan.tagsoup/tagsoup/1.2.
On SaxonCS platform, the HTML is parsed using HTMLAgilityPack, which is registered as a dependency and will normally be installed automatically by nuget.
This function is useful where an HTML document is embedded inside another using CDATA. It can also be used in conjunction with the unparsed-text() function to read HTML from filestore. Note that the base URI of the document is not retained in this case.
Because different parsers are used, and there is no standard mapping from HTML to the XDM model, there are some differences depending on the platform:
- On both platforms, HTML elements will have lower-case local names, and will be in
the XHTML namespace (
On Java, there will be an additional namespace declaration binding the prefix
htmlto the XHTML namespace.
On Java, default attributes are expanded, for example an element written as
<br/>will appear in the XDM model with the additional attribute
On SaxonCS, defaulted attribute values are not included in the XDM model.