Saxon: Anatomy of an XSLT processor

Michael H. Kay

This article describes how an XSLT processor, in this case the author's open-source Saxon, actually works. Although several open-source XSLT implementations exist, no one, as far as we know, has published a description of how they work. This article is intended to fill that gap. It describes the internal workings of Saxon, and shows how this processor addresses XSLT optimization. It also shows how much more work remains to be done. This article assumes that you already know what XSLT is and how it works.

I hope this article serves a number of purposes. First, I hope it will give style sheet authors a feel for what kind of optimizations they can expect an XSLT processor to take care of, and by implication, some of the constructs that are not currently being optimized. Of course, the details of such optimizations vary from one processor to another and from one release to another, but I'm hoping that reading this account will give you a much better feel for the work that's going on behind the scenes. Second, it describes what I believe is the current state of the art in XSLT technology (I don't think Saxon is fundamentally more or less advanced than other XSLT processors in this respect), and describes areas where I think there is scope for further development of techniques. I hope this description might stimulate further work in this area by researchers with experience in compiler and database optimization.

Finally (last and also least), this article is intended to be a starting point for anyone who wants to study the Saxon source code. It isn't written as a tour of the code, and it doesn't assume that you want to go into that level of depth. But if you are interested in getting a higher-level overview than you can get by diving into the JavaDoc specs or the source code itself, you'll probably find this useful.

A couple of caveats: the version I describe is Saxon 6.1, and I describe a functional breakdown of the code that doesn't always map cleanly to the package and module structure. For example, this article describes the compiler and interpreter as separate functional components. But in the actual code, the module that handles the <xsl:choose> instruction, for example, contains both compile-time and run-time code to support this instruction. In case you do want to use this article as a guide to the code, I've included occasional references to package and module names so you know where to look.

First I'll describe the design of the Saxon product. Saxon is an XSLT processor. That is, it is a program that takes an XML document and a style sheet as input and produces a result document as output.

Saxon includes a copy of the open-source AElfred XML parser originally written by David Megginson, although it can be used with any other XML parser that implements the Java SAX interface. Saxon also includes a serializer that converts the result tree to XML, HTML, or plain text. The serializer is not technically part of the XSLT processor, but it is essential for practical use.

Saxon implements the TrAX (transformation API for XML) interface defined as part of the JAXP 1.1 Java extensions. You don't need to know about this interface to appreciate this article, but understanding the architecture of TrAX would help you to understand the way Saxon is structured.

Saxon architecture: The main components of the Saxon software are shown in Figure 1.

Saxon Architecture

Saxon Architecture

The tree constructor creates a tree representation of a source XML document. It is used to process both the source document and the style sheet. There are two parts to this:

The interface between the parser and the builder is the Java SAX2 API. Although this SAX API has been only informally standardized, it is implemented by half a dozen freely available XML parsers, allowing Saxon to be used with any of these parsers interchangeably. In between the parser and the tree builder sits a component which I call the Stripper (I couldn't resist the name): The Stripper performs the function of removing whitespace text nodes before they are added to the tree, according to the <xsl:preserve-space> and <xsl:strip-space> directives in the style sheet (module com.icl.saxon.Stripper). The Stripper is a good example of a SAX filter, a piece of code that takes a stream of SAX events as input and produces another stream of SAX events as output. At a more macroscopic level, an entire Saxon transformation can also be manipulated as a SAX filter. This approach makes it very easy to split up a complex transformation into a series of simple transformations arranged in a pipeline.

The tree navigator, as the name suggests, allows applications to select nodes from the tree by navigating through the hierarchy. The tree representation constructed by the builder component is proprietary to Saxon. This is an area where Saxon differs from some other XSLT processors: some of them use a general-purpose DOM model as their internal tree. The advantage of using the DOM is that the tree can then be produced by third-party software. Trees constructed for a different purpose can be supplied directly as input to a transformation, and equally, the output of a transformation can be used directly by DOM-based applications.

In Saxon I took the view that the interoperability offered by using the DOM comes at too high a cost. First, the DOM tree model differs subtly from the XPath model needed by an XSLT processor, and this difference imposes run-time costs in mapping one model to the other. For example, a DOM tree can contain information that the XPath model does not require, such as entity nodes. Second, DOM trees can be updated in place, whereas the XSLT processing model means that trees are written only sequentially. Designing a tree model that can be written only sequentially allows efficiencies to be achieved. For example, each node can contain a sequence number that makes it easy to sort nodes in their sequential document order, a frequent XSLT requirement. Finally, DOM implementations generally include a lot of synchronization code to make multithreaded access safe. Because the XSLT processing model is "write-once, read-many," the synchronization logic can be much simpler, leading to faster navigation of the tree.

Actually, as you will see, Saxon offers two different tree implementations, each with its own builder and navigation classes (packages com.icl.saxon.tree and com.icl.saxon.tinytree). The two implementations offer different performance trade-offs.

The style sheet compiler analyses the style sheet prior to execution. It does not produce executable code; it produces a decorated-tree representation of the style sheet in which all XPath expressions are validated and parsed, all cross-references are resolved, stack-frame slots are pre-allocated, and so on. The style sheet compiler thereby performs the important function of constructing a decision tree to use at execution time to find the right template rule to process each input node; it would be grossly inefficient to try matching each node against each possible pattern. The decorated tree then comes into play at transformation time to drive the style sheet processing. (The compiler is distributed across the classes in the com.icl.saxon.style package, especially the methods prepareAttributes(), preprocess(), and validate()).

At one stage Saxon did actually include a style sheet compiler that produced executable Java code. However, it handled only a subset of the XSLT language, and as the technology developed, the performance gains achieved by full compilation were dwindling. Eventually I abandoned that approach as the development complexity grew while the performance benefits declined. There is currently no full XSLT compiler on the market. Sun has produced an alpha release of a compiler called XSLTC which looks promising (see Resources), though it is still at an early stage of development.

The decorated tree produced by Saxon's style sheet compiler (rooted at class com.icl.saxon.style.XSLStyleSheet) cannot be saved to disk, because reading the tree back into memory takes longer than recompiling the original (largely because of its increased size). You can reuse the tree so long as it remains in memory. The tree is wrapped in an object called the PreparedStyleSheet, which implements the javax.xml.transform.Templates interface in JAXP 1.1. It is quite common in a server environment to use the same style sheet repeatedly to transform many different source documents. To allow this, the compiled style sheet is strictly read-only at execution time, allowing it to be used in multiple execution threads simultaneously.

The core of the Saxon processor is the style sheet interpreter (class com.icl.saxon.Controller, which implements the javax.xml.transform.Transformer interface in JAXP 1.1). This interpreter uses the decorated style sheet tree to drive processing. Following the processing model of the language, it first locates the template rule for processing the root node of the input tree. Then it evaluates that template rule (or it's "instantiated," in the jargon of the standard).

The style sheet tree uses different Java classes to represent each XSL instruction type. For example, consider the instruction:


<xsl:if test="parent::section">
    <h3><xsl:value-of select="../@title"/></h3>
</xsl:if>

The effect of this code fragment is to output an <h3> element to the result tree if the current node on the source tree has a parent element of type <section>; the text content of the generated <h3> node is the value of the title attribute of the parent section.

This code fragment is represented on the decorated style sheet tree by the structure shown in Figure 2.

The decorated style sheet tree

Figure 2. The decorated style sheet tree

Elements in the style sheet map directly to nodes on the tree, as shown in Table 1. All the Java objects that represent elements are subclasses of com.icl.saxon.style.StyleElement, which is a subclass of com.icl.saxon.tree.ElementImpl, the default implementation of an element node in the Saxon tree structure. The two XPath expressions are represented by com.icl.saxon.expr.Expression objects referenced from the nodes of the tree.

Table 1. Style sheet elements and their corresponding Java classes

Element or expression... ...represented by an object in this Java class
(subclasses of com.icl.saxon.style.StyleElement)
<xsl:if> com.icl.saxon.style.XSLIf
<h3> (output, not instruction) com.icl.saxon.style.LiteralResultElement
<xsl:value-of> com.icl.saxon.style.XSLValueOf.
XPath expressions com.icl.saxon.expr.Expression

Executing the <xsl:if> instruction causes the process() method of the corresponding XSLIf object to be executed. This method accesses the test Expression object, which has a method, evaluateAsBoolean(). evaluateAsBoolean() is used to evaluate the expression to return a Boolean result. (This is an optimization: it would be possible to use a straightforward evaluate() call and then convert the result to a Boolean, as described in the specification. But knowing that a Boolean is wanted often enables faster evaluation. For example, when the actual value or the expression is a node-set, the final Boolean result is known as soon as a single member of the node-set is found).

If the result of evaluateAsBoolean() is true, the process() method calls the process() method of all the child nodes of the XSLIf node on the style sheet tree. If the result not true, it simply exits.

Similarly, the process() method for a LiteralResultElement copies the element to the result tree and processes the children of the LiteralResultElement, while the process() method of the XSLValueOf object evaluates the select expression as a string, and copies the result as a text node to the result tree.

So the key components of the style sheet interpreter are:

The XPath interpreter (package com.icl.saxon.expr) closely follows the Interpreter design pattern, one of the 23 classic patterns for object-oriented software described by Gamma, Helm, Johnson, and Vlissides. Each construct in the XPath grammar has a corresponding Java class. For example, the UnionExpr construct (written as A|B, and representing the union of two node sets) is implemented by the class com.icl.saxon.expr.UnionExpression. The XPath expression parser (module com.icl.saxon.expr.ExpressionParser), which is normally executed when the style sheet is compiled, generates a data structure that directly reflects the parse tree of the expression. To evaluate the expression, each class in this structure has an evaluate() method, which is responsible for returning its value. In the case of the UnionExpression class, the evaluate() method evaluates the two operands, checks that the result is in both cases a node-set, and then forms the union using a sort-merge strategy.

As in the design pattern described by Gamma et al., the evaluate() method takes a Context parameter. The Context object encapsulates all contextual information needed to evaluate the expression.

This includes:

The XPath interpreter also includes optimization features that extend the basic Interpreter pattern of Gamma et al.

For example:

The XSLT language gives the processor great freedom to evaluate expressions in any order it chooses, because of the absence of side effects. The general policy in Saxon is that scalar values (strings, numbers, Booleans) are evaluated as early as possible, while node-set values are evaluated as late as possible. Evaluating scalar values early enables optimization by doing things only once. Delaying the evaluation of node-set values saves memory, by avoiding holding large lists in memory unnecessarily. It can also save time, if it turns out (as it often does) that the only thing done with the node-set is to test whether it is empty, or to get the value of its first element.

Finally, the outputter component (class com.icl.saxon.output.Outputter) is used to control the output process. Saxon's result tree is not normally materialized in memory -- because its nodes are always written in document order, they can be serialized as soon as they are output to the result tree. In practice the transformation does not have a single result tree but a changing stack of result trees, because XSL instructions such as <xsl:message> and <xsl:attribute> effectively redirect the output to an internal destination, while <xsl:variable> constructs a result-tree fragment which is actually a separate tree in its own right. The interpreter code for these elements calls the outputter to switch to a new destination and subsequently to revert to the original destination.

External output is written to a file using a serializer. Logically this takes the result tree as input and turns it into a flat file document. In practice, as you have seen, the result tree is not materialized in memory, so the serializer is handed the nodes of the tree one at a time in document order. This stream of nodes is presented using a SAX2-like interface (com.icl.saxon.Emitter): it differs from SAX2 in the details of how names and namespaces are represented. As defined in the XSLT Recommendation, there are separate serializers for XML, HTML, and plain text output. Saxon also allows the tree to be supplied to user-written code for further processing, or to be fed as input to another style sheet. This allows a you to achieve a multiphase transformation by applying several style sheets in sequence.

Performance
Good performance is necessarily a driving factor in the design of Saxon, second only to conformance with the XSLT specification. This is partly because it is critical to users, but also because in a world where there are several free XSLT processors available, performance will tend to be the main distinguishing feature.

This section discusses some of the factors that affect the performance of an XSLT processor, as well as the strategies Saxon uses to improve speed in each case.

Java language issues
It is often said that Java is slow. There is some justification in this, but the statement needs to be carefully qualified.

Many people imagine Java is slow because it generates interpreted bytecode rather than native code. This used to be true, but not any longer with today's just-in-time compilers. Raw code execution speed is usually almost as good as -- sometimes better than -- the equivalent code written in a compiled language such as C.

Where Java can have a problem is with memory allocation. Unlike C and C++, Java takes care of memory itself, using a garbage collector to free unwanted objects. This brings great convenience to the programmer, but it is easy to create programs that are profligate with memory: they thrash due to excessive use of virtual memory, or they place great strain on the garbage collector due to the frequency with which objects are allocated and released.

Some coding techniques minimize the memory-allocation problems. For example the use of StringBuffer objects rather than Strings, use of pools of reusable objects, and so on. Diagnostic tools can help the programmer determine when to use those techniques. Getting the code fast does require a lot of tuning, but that is arguably still much easier than using a language such as C++, in which you must manage all the memory allocation manually.

XSLT processing brings a particular challenge to Java in the implementation of the tree structure. Java imposes considerable red-tape overhead in the size of each object (up to 32 bytes, depending on the Java VM used). This often yields a tree structure in memory many times larger than the source XML file. For example, the empty element <a/> (four bytes in the source file) could expand to an Element object for the node, a Name object for its name, a String object referenced by the Name object, an empty AttributeList node, an empty NamespaceList node, plus numerous 64-bit object references to link these objects with each other and with the parent, sibling, and child nodes in the tree. A nave implementation could easily generate 200 bytes of tree storage from these four bytes of source. Given that some users are trying to process XML documents whose raw size is 100MB or more, the consequences are predictable and generally fatal.

This is one reason Saxon went down the route of having its own tree structure. By removing the requirement to implement the full DOM interface, I was able to eliminate some data from the tree. Removing the requirement to support update is particularly useful. For example, Saxon uses a different class for elements that have no attributes, knowing that if an element has no attributes to start with, it will never acquire any later. Another technique Saxon uses is to optimize the storage of the common situation of an element that contains a single child text node, for example <b>text</b>.

The XPath tree model, as described in the W3C specification, includes nodes for attributes and namespaces. Because these nodes are rarely accessed in the course of a transformation, Saxon constructs these nodes on demand rather than having them permanently take up space on the tree. (This is the Flyweight design pattern of Gamma et al.)

The latest release of Saxon has gone one step further: using a tree implementation in which the nodes are not represented by Java objects at all. Instead, all the information in the tree is represented as arrays of integers. All nodes are created as transient (or flyweight) objects, constructed on demand as references into these arrays and discarded when they are no longer needed. This tree implementation (package com.icl.saxon.tinytree) takes up far less memory and is quicker to build, at the cost of slightly slower tree navigation. On balance, it appears to perform better than the standard tree, and I therefore provide it as the default.

The standard utility classes such as Hashtable and Vector also affect Java program performance. Developers find it tempting to use these convenient classes liberally throughout an application. However, there is a price to pay. Partly because the classes usually do more than you actually need, they impose more overhead than a class designed for one purpose only. They are also designed to handle a worst-case situation in terms of multithreading. If you know that a data structure will not be accessed by multiple threads simultaneously, you can spare yourself the synchronization costs by designing your own objects rather than using these off-the-shelf classes. Replacing Vectors by arrays often pays dividends, the only downside being that you need to handle expansion of the array manually whereas Vectors are self managing.

Location path evaluation
The most characteristic kind of XPath expression (the one from which XPath gets its name) is the location path. Location paths are used to select nodes in the source tree. A location path essentially consists of an origin and a number of steps. It is similar to a UNIX filename or a URL, except that each step selects a set of nodes rather than a single node. For example ./chapter/section selects all the <section> children of all the <chapter> children of the current node. The origin identifies the start point for navigating the tree: it might be the current node, the root node of the source tree, the root node of a different tree, or a set of nodes located by value using a key. Each step in the location path navigates from one node to a related set of nodes. Each step is defined in terms of a navigation axis (the child axis being the default): For example the ancestor axis selects all ancestor nodes, the following-sibling axis selects all following siblings of the origin node, the child axis selects its children. As well as specifying an axis, each step may specify the type of node required (such as elements, attributes, or text nodes), the name of the required nodes, and predicates that the nodes must satisfy (for example, child text nodes whose value begins with B).

Devising an execution strategy for a location path is equivalent to the problem of optimizing a relational query, though the theory is currently much less advanced, and most of the techniques used differ little from the nave strategy of doing the navigation exactly the way it is described in the specification. For example, although it is possible in a style sheet to specify keys that must be built to support associative access (rather like the CREATE INDEX statement in SQL), Saxon currently uses these indexes only to support queries that reference them explicitly (by using the key() function) and never to optimize a query that uses straightforward predicates.

The optimization techniques currently used in Saxon for location paths include:

Style sheet compilation
I have already described how the first thing Saxon does is to "compile" the style sheet into a decorated tree for efficient execution subsequently. This offers a great opportunity to do things once only rather than doing them each time the relevant instructions are executed.

Some of the tasks done during the style sheet compilation phase are as follows:

In other cases, doing things at compile time is less feasible, but savings can be made by avoiding repeated execution at run time. An example is the format-number() function, which takes as one of its arguments a pattern describing the output format required for a decimal number. Considerable savings are possible by detecting the common case where the format pattern is the same as on the previous execution. The only tricky aspect of such optimizations is to keep the memory of previous executions in a place associated with the current thread: it cannot be kept on the style sheet tree itself, as that needs to be thread safe.

Pattern matching for template rules

The pattern matching operation is potentially expensive, so it is vital to focus the search intelligently. The style sheet compiler therefore constructs a decision tree which is used at run time to decide which template rule to apply to a given node.

I'm using the term decision tree here loosely. This section describes the actual data structures and algorithms in a little more detail. (See modules com.icl.saxon.RuleManager and com.icl.saxon.Mode in the source code.)

When the <xsl:apply-templates/> instruction is applied to a node, a template rule must be selected for that node. If there is no matching rule, Saxon uses a built-in rule. If there is more than one rule, the processor resorts to an algorithm from the XSLT specification for deciding which rule takes precedence. This algorithm is based on user-allocated priorities, system-allocated priorities, the precedence relationships established when one style sheet imports another, and -- if all else fails -- the relative order of rules in the source style sheet.

In a typical style sheet, most template rules match element nodes in the tree. Rules to match other nodes, such as text nodes, attributes, and comments, are comparatively rare. Also, most template rules supply the name of the element they must match in full. Rules for unnamed nodes are allowed but not often used (for example, *[string-length(.) > 10], which matches any element with more than 10 characters of text content).

Saxon's strategy is therefore to separate rules into two kinds: specific rules, where the node type and name are explicitly specified in the pattern, and general rules, where they aren't. The data structure for the decision tree contains a hash table for the specific rules, keyed on the node type and node name, in which each entry is a list of rules sorted by decreasing priority; plus a single list for all the general rules, again in priority order. To find the pattern for a particular node, the processor makes two searches: one for the highest-priority specific rule for the relevant node type and name that matches the node, and one for the highest-priority general rule that matches the node. Whichever of these has highest priority is then chosen.

For a multipart pattern such as chapter/title, the algorithm used is recursive: The match is true if the node being tested matches title and if its parent node matches chapter (module com.icl.saxon.pattern.LocationPathPattern). This simple approach can't be used for patterns that use positional predicates; for example chapter/para[last()], which only matches a para element if it is the last one in a chapter. Matching these positional patterns is potentially very expensive, so it's worth handling the common case of a pattern like para[1] specially.

Numbering

Numbering the nodes on the tree (using the <xsl:number/> instruction) poses a particular optimization challenge. This is because each execution of <xsl:number/> works independently to assign a number to the current node, the number being defined by a complex algorithm using various attributes on the <xsl:number/> instruction. Nothing inherent in the algorithm says that if the last node was numbered 19, the next one will be numbered 20, yet in most common cases that is indeed the case. It is important to detect those common sequential cases. Otherwise the numbering of a large node-set will have O(n2) performance, which is what happens if the numbering algorithm as specified in the XSLT Recommendation is applied to each node independently.

Saxon achieves this optimization for a small number of common cases, where most of the attributes to the numbering algorithm are defaulted. Specifically, it remembers the most recent result of an <xsl:number/> instruction, and if certain rather complex but frequently satisfied conditions are true, it knows that it can number a node by adding one to this remembered number.

Finally
I have tried in this article to give an overview of the internals of the Saxon XSLT processor, and in particular of some of the techniques it uses to improve the speed of transformation. In the 18 months or so since I released the first early versions of Saxon, performance has improved by a factor of 20 (or more, in the case of runs that were thrashing for lack of memory).

It's unlikely that the next 18 months will see a similar improvement. However, there is still plenty of scope, especially for constructs like <xsl:number/>. To take another example, Saxon has not even started to explore to possibilities opened by parallel execution, something the language makes a highly attractive option.

Perhaps the biggest research challenge is to write an XSLT processor that can operate without building the source tree in memory. Many people would welcome such a development, but it certainly isn't an easy thing to do.