Collation

Collations used for comparing strings can be specified by means of a URI. A collation URI may be used as an argument to many of the standard functions, and also as an attribute of various instructions ( xsl:sort , xsl:for-each-group , xsl:merge-key in XSLT, and in the order by clause of a FLWOR expression in XQuery.

Saxon provides a range of mechanisms for binding collation URIs. The language specifications simply say that collations used in sorting and in string-comparison functions are identified by a URI, and leaves it up to the implementation how these URIs are defined.

There are some predefined collations that cannot be changed. Specifically:

For backwards compatibility reasons the standard collation resolver in Saxon also accepts URIs in the form http://saxon.sf.net/collation followed by query parameters; the query parameters that are recognized are the same as those defined by W3C UCA collation URIs.

The keywords defined by W3C are: fallback, lang, version, strength, maxVariable, alternate, backwards, normalization, caseLevel, caseFirst, numeric, reorder. The values for these parameters and their meaning can be found at https://www.w3.org/TR/xslt-30/#uca-collations.

Whether the W3C URI http://www.w3.org/2013/collation/UCA or the Saxon URI http://saxon.sf.net/collation is used, Saxon accepts a number of collation parameters additional to those defined by W3C, as follows:

keyword

values

effect

class

fully-qualified Java class name of a class that implements java.util.Comparator.

This parameter should not be combined with any other parameter. An instance of the requested class is created, and is used to perform the comparisons. Note that if the collation is to be used in functions such as contains() and starts-with(), this class must also be a java.text.RuleBasedCollator. This approach allows a user-defined collation to be implemented in Java. This option is also available on the .NET platform, but the class must implement the Java interface java.util.Comparator.

rules

details of the ordering required, using the syntax of the Java RuleBasedCollator

This defines exactly how individual characters are collated. (It's not very convenient to specify this as part of a URI, but the option is provided for completeness.) This option is also available on the .NET platform, and if used will select a collation provided using the OpenJDK implementation of RuleBasedCollator.

ignore-case

yes | no

Indicates whether the case of letters should be ignored: equivalent to strength=secondary.

ignore-modifiers

yes | no

Indicates whether non-spacing combining characters (such as accents and diacritical marks) are considered significant. Note that even when ignore-modifiers is set to "no", modifiers are less significant than the actual letter value, so that "Hofen" and "Höfen" will appear next to each other in the sorted sequence. Equivalent to strength=secondary.

ignore-symbols

yes | no

Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default.

ignore-width

yes | no

Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary.

decomposition

none | standard | full

Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform.

alphanumeric

yes | no | codepoint

If set to yes, the string is split into a sequence of alphabetic and numeric parts (a numeric part is any consecutive sequence of ASCII digits; anything else is considered alphabetic). Each numeric part is considered to be preceded by an alphabetic part even if it is zero-length. The parts are then compared pairwise: alphabetic parts using the collation implied by the other query parameters, numeric parts using their numeric value. The result is that, for example, AD985 collates before AD1066. (This is sometimes called natural sorting.) The value "codepoint" requests alphanumeric collation with the "alpha" parts being collated by Unicode codepoint, rather than by the default collation for the Locale. This may give better results in the case of strings that contain spaces. Note that an alphanumeric collation cannot be used in conjunction with functions such as contains() and substring-before().

case-order

upper-first | lower-first

Indicates whether upper case letters collate before or after lower case letters.

If you want to use your own URIs to define collations, there are two ways of doing this:

In either case, the collation is supplied in the form of an implementation of the interface net.sf.saxon.lib.StringCollator. You must supply methods compareStrings() and comparesEqual() for comparing strings for ordering or equality, and a method getCollationKey() which returns a collation key for any string. If you want your collation to be used in calls of fn:contains(), fn:starts-with(), fn:ends-with(), fn:substring-before(), or fn:substring-after(), then it must also implement the interface net.sf.saxon.lib.SubstringMatcher.