Sorting and collations

Different countries (or languages) have different rules for sorting strings into alphabetical order. For example, in German "Ä" comes between "A" and "B", while in Swedish it comes after "Z". (And the rules are a lot more complicated than this, because diacritical marks are ignored unless all the letters in the word are identical.)

In addition, two strings such as ("ALPHA", "alpha"), or ("Jäger", "Jaeger") may or may not be considered to match when comparing for equality. In this case the rules depend less on the language involved, and more on the requirements of the application.

All operations in XPath, XSLT, and XQuery that depend on ordering strings therefore allow a collation to be specified. A collation is simply a rule for deciding whether two strings are equal, and if not, which one sorts first. Collations are identified using a URI.

A collation URI may be used as an argument to many of the standard functions, and also as an attribute of various instructions (xsl:sort, xsl:for-each-group, xsl:merge-key in XSLT, and in the order by clause of a FLWOR expression in XQuery.)

In Saxon the default collation is always the "codepoint" collation. This collates strings based on the integer values assigned by Unicode to each character: for example "ah!" sorts before "ah?" because the Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63). This generally gives good results for artificial strings such as part numbers, vehicle registration marks, and file names, but it's inadequate for natural language text. The codepoint collation may be requested explicitly using the URI http://www.w3.org/2005/xpath-functions/collation/codepoint.

The default collation may be changed for a portion of an XSLT stylesheet by use of an [xsl:]default-collation attribute on an enclosing element; and it can be changed for an XQuery module using the declare default collation declaration in the XQuery prolog. It can also be changed using the Saxon API, for example XsltCompiler.declareDefaultCollation(X); (SaxonJ) or XsltCompiler.DefaultCollationName = X; (SaxonCS).

Two more kinds of collation are defined in the W3C language specifications, and are recognized in all versions of Saxon (though there may be differences in the details of the output):

Further details on the Unicode Collation Algorithm are supplied below.

Saxon also allows a collation to be supplied programmatically:

When the xsl:sort instruction is used without an explicit collation, Saxon attempts to construct a collation using the supplied values of attributes such as lang and case-order.

For backwards compatibility reasons the standard collation resolver in Saxon also accepts URIs in the form http://saxon.sf.net/collation followed by query parameters; the query parameters that are recognized are the same as those defined by W3C UCA collation URIs.