Collation

Collations used for comparing strings can be specified by means of a URI. A collation URI may be used as an argument to many of the standard functions, and also as an attribute of various instructions ( xsl:sort , xsl:for-each-group , xsl:merge-key in XSLT, and in the order by clause of a FLWOR expression in XQuery.

Saxon provides a range of mechanisms for binding collation URIs. The language specifications simply say that collations used in sorting and in string-comparison functions are identified by a URI, and leaves it up to the implementation how these URIs are defined.

There are some predefined collations that cannot be changed. Specifically:

The Unicode Codepoint Collation defined in the W3C specifications (see http://www.w3.org/2005/xpath-functions/collation/codepoint). This collates strings based on the integer values assigned by Unicode to each character, for example "ah!" sorts before "ah?" because the Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63).
The "HTML ASCII case-blind collation", http://www.w3.org/2005/xpath-functions/collation/html-ascii-case-insensitive". This is designed to mimic the HTML5 rules for matching many names and keywords, whereby case distinctions are ignored for the English letters (A-Z, a-z) but not for accented or non-English letters.

Saxon implements this effectively by converting upper-case letters A-Z to their lower-case equivalents, and then using the codepoint collation.
The family of collations implementing the Unicode Collation Algorithm, 'http://www.w3.org/2013/collation/UCA?keyword=value;...

Saxon-PE and Saxon-EE implement UCA collations by use of the ICU-J open source library, which supports a large range of languages and all the parameters defined in the UCA specification. Saxon-HE has a simpler implementation which relies on the collation support available in the built-in Java library; the languages this supports depend on the particular Java installation, and not all parameters are supported. For this reason, Saxon-HE supports UCA collations only if fallback implementation is allowed (that is, if the option fallback=no is not present in the collation URI.

For backwards compatibility reasons the standard collation resolver in Saxon also accepts URIs in the form http://saxon.sf.net/collation followed by query parameters; the query parameters that are recognized are the same as those defined by W3C UCA collation URIs.

The keywords defined by W3C are: fallback, lang, version, strength, maxVariable, alternate, backwards, normalization, caseLevel, caseFirst, numeric, reorder. The values for these parameters and their meaning can be found at https://www.w3.org/TR/xslt-30/#uca-collations.

Whether the W3C URI http://www.w3.org/2013/collation/UCA or the Saxon URI http://saxon.sf.net/collation is used, Saxon accepts a number of collation parameters additional to those defined by W3C, as follows:

keyword	values	effect
class	fully-qualified Java class name of a class that implements `java.util.Comparator`.	This parameter should not be combined with any other parameter. An instance of the requested class is created, and is used to perform the comparisons. Note that if the collation is to be used in functions such as `contains()` and `starts-with()`, this class must also be a `java.text.RuleBasedCollator`. This approach allows a user-defined collation to be implemented in Java. This option is also available on the .NET platform, but the class must implement the Java interface `java.util.Comparator`.
rules	details of the ordering required, using the syntax of the Java `RuleBasedCollator`	This defines exactly how individual characters are collated. (It's not very convenient to specify this as part of a URI, but the option is provided for completeness.) This option is also available on the .NET platform, and if used will select a collation provided using the OpenJDK implementation of `RuleBasedCollator`.
ignore-case	yes \| no	Indicates whether the case of letters should be ignored: equivalent to `strength=secondary`.
ignore-modifiers	yes \| no	Indicates whether non-spacing combining characters (such as accents and diacritical marks) are considered significant. Note that even when ignore-modifiers is set to "no", modifiers are less significant than the actual letter value, so that "Hofen" and "Höfen" will appear next to each other in the sorted sequence. Equivalent to `strength=secondary`.
ignore-symbols	yes \| no	Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default.
ignore-width	yes \| no	Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary.
decomposition	none \| standard \| full	Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform.
alphanumeric	yes \| no \| codepoint	If set to yes, the string is split into a sequence of alphabetic and numeric parts (a numeric part is any consecutive sequence of ASCII digits; anything else is considered alphabetic). Each numeric part is considered to be preceded by an alphabetic part even if it is zero-length. The parts are then compared pairwise: alphabetic parts using the collation implied by the other query parameters, numeric parts using their numeric value. The result is that, for example, AD985 collates before AD1066. (This is sometimes called natural sorting.) The value "codepoint" requests alphanumeric collation with the "alpha" parts being collated by Unicode codepoint, rather than by the default collation for the Locale. This may give better results in the case of strings that contain spaces. Note that an alphanumeric collation cannot be used in conjunction with functions such as `contains()` and `substring-before()`.
case-order	upper-first \| lower-first	Indicates whether upper case letters collate before or after lower case letters.

If you want to use your own URIs to define collations, there are two ways of doing this:

You can use the Saxon configuration file to define collations: see The collations element.
You can register a collation with the Saxon Configuration using the method s9api method Processor.declareCollation() or the underlying method Configuration.registerCollation().

In either case, the collation is supplied in the form of an implementation of the interface net.sf.saxon.lib.StringCollator. You must supply methods compareStrings() and comparesEqual() for comparing strings for ordering or equality, and a method getCollationKey() which returns a collation key for any string. If you want your collation to be used in calls of fn:contains(), fn:starts-with(), fn:ends-with(), fn:substring-before(), or fn:substring-after(), then it must also implement the interface net.sf.saxon.lib.SubstringMatcher.

The set of collation URIs known to Saxon is defined in the Configuration. In earlier releases, it was possible to define collation URIs with narrower scope, for example a single query or stylesheet. Some legacy APIs reflect this capability, but they are generally deprecated and either have no effect, or have a Configuration-wide effect.

The choice of default collation, however, is local to a query or stylesheet (in XSLT, it can even vary between different parts of a stylesheet). The default collation must always be one of the collations defined in the Configuration.