Writing input filters

SaxonJ generally takes its input from a JAXP SAXSource object, which represents a sequence of SAX events as output by an XML parser. These events are sent to the internal class ReceivingContentHandler, which converts them to a slightly different format, which are then passed to a Saxon Receiver. In a typical scenario, the events are passed through a pipeline of Receivers, each of which modifies the events in some way.

Examples of the steps on this pipeline include:

  • A whitespace stripper, responsible for removing whitespace as directed by the xsl:strip-space and xsl:preserve-space declarations.
  • A schema validator, responsible for performing schema validation (which not only validates the input against the schema, but also adds type annotations and expands default values for absent attributes).
  • An annotation stripper, responsible for removing type annotations as directed by the input-type-annotations="strip" attribute in a stylesheet.

At the end of this pipeline, the events are typically passed to one of:

  • A tree builder, which builds a tree of nodes, ready for query or transformation.
  • A streaming XSLT transformation.
  • A serializer (to implement an identity transformation).

It is possible to add a user-written filter to the input pipeline. This might be used, for example, to:

  • Rename elements or attributes, perhaps changing their namespace.
  • Add or remove elements or attributes.
  • Strip comments or processing instructions.
  • Expand processing instructions (for example, a processing instruction might contain a SQL query to access a database).
  • Perform a complete XSLT transformation, streamed or unstreamed.

A filter can either be inserted to process SAX events, before they are converted to Receiver events, or it can be inserted to process Receiver events after the conversion.

To filter events at the SAX level, the techniques include:

  • Generate the transformation as an XMLFilter using the newXMLFilter() method of the TransformerFactory. This works with XSLT only. A drawback of this approach is that it is not possible to supply parameters to the transformation using standard JAXP facilities. It is possible, however, by casting the XMLFilter to a net.sf.saxon.jaxp.FilterImpl, and calling its getTransformer() method, which returns a Transformer object offering the usual addParameter() method.

  • Generate the transformation as a SAX ContentHandler using the newTransformerHandler() method. The pipeline stages after the transformation can be added by giving the transformation a SAXResult as its destination. This again is XSLT only.

  • Implement the pipeline step before the transformation or query as an XMLFilter, and use this as the XMLReader part of a SAXSource, pretending to be an XML parser. This technique works with both XSLT and XQuery, and it can even be used from the command line, by nominating the XMLFilter as the source parser using the -x option on the command line.

To insert a filter for Receiver events, it is usual to implement the filter by extending the class ProxyReceiver, overriding only the methods for those events that need to be changed. The filter can be injected into the pipeline by supplying the document in the form of an AugmentedSource: a typical example would be:

AugmentedSource as = AugmentedSource.makeAugmentedSource(new StreamSource(...)); as.addFilter(receiver -> new MyFilter(receiver)); documentBuilder.build(as);

Here MyFilter is typically a class that extends ProxyReceiver by overriding some of its methods: for example, you might override the comment() method to do nothing, which has the effect of stripping comments from the source document.

Filters inserted into the pipeline in this way are applied after any system-defined filters such as the schema validator.