Sorting and collations

Different countries (or languages) have different rules for sorting strings into alphabetical order. For example, in German "Ä" comes between "A" and "B", while in Swedish it comes after "Z". (And the rules are a lot more complicated than this, because diacritical marks are ignored unless all the letters in the word are identical.)

In addition, two strings such as ("ALPHA", "alpha"), or ("Jäger", "Jaeger") may or may not be considered to match when comparing for equality. In this case the rules depend less on the language involved, and more on the requirements of the application.

All operations in XPath, XSLT, and XQuery that depend on ordering strings therefore allow a collation to be specified. A collation is simply a rule for deciding whether two strings are equal, and if not, which one sorts first. Collations are identified using a URI.

A collation URI may be used as an argument to many of the standard functions, and also as an attribute of various instructions (xsl:sort, xsl:for-each-group, xsl:merge-key in XSLT, and in the order by clause of a FLWOR expression in XQuery.)

In Saxon the default collation is always the "codepoint" collation. This collates strings based on the integer values assigned by Unicode to each character: for example "ah!" sorts before "ah?" because the Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63). This generally gives good results for artificial strings such as part numbers, vehicle registration marks, and file names, but it's inadequate for natural language text. The codepoint collation may be requested explicitly using the URI http://www.w3.org/2005/xpath-functions/collation/codepoint.

The default collation may be changed for a portion of an XSLT stylesheet by use of an [xsl:]default-collation attribute on an enclosing element; and it can be changed for an XQuery module using the declare default collation declaration in the XQuery prolog. It can also be changed using the Saxon API, for example XsltCompiler.declareDefaultCollation(X); (SaxonJ) or XsltCompiler.DefaultCollationName = X; (SaxonCS).

Two more kinds of collation are defined in the W3C language specifications, and are recognized in all versions of Saxon (though there may be differences in the details of the output):

The ASCII case-blind collation (URI http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive) uses codepoint comparison for most characters, but treats ASCII lower-case letters as equal to their upper-case equivalents. This collation is defined by the HTML5 standard, and it's an efficient way of matching ASCII keywords such as "ascending" and "descending", but it's not suitable for more general text. It's intended only for use in equality comparisons, not for sorting.
The Unicode Collation Algorithm (UCA) represents a family of collations, with the exact rules depending on parameters that are supplied. A typical URI for a UCA collation would be http://www.w3.org/2013/collation/UCA?lang=sv;strength=primary where the parameters after the "?" indicate the detailed rules to be applied.

Further details on the Unicode Collation Algorithm are supplied below.

Saxon also allows a collation to be supplied programmatically:

On SaxonJ, use the method Processor.declareCollation() or Configuration.registerCollation().
On SaxonCS, use the method Processor.DeclareCollation().

When the xsl:sort instruction is used without an explicit collation, Saxon attempts to construct a collation using the supplied values of attributes such as lang and case-order.

For backwards compatibility reasons the standard collation resolver in Saxon also accepts URIs in the form http://saxon.sf.net/collation followed by query parameters; the query parameters that are recognized are the same as those defined by W3C UCA collation URIs. In SaxonJ these collations are implemented using a RuleBasedCollator supplied by the JDK; they do not use ICU collations. Similarly, in SaxonCS, they are implemented using the standard facilities of the .NET platform.