Unicode Collation Algorithm

XSL Transformations (XSLT) Version 3.0 supports the use of the Unicode Collation Algorithm (UCA) for comparing strings in a variety of locales and with extensive parametric control. This feature is requested by using a collation (specified either as a collation attribute on an xsl:sort instruction, an xsl:default-collation attribute within an XSLT tree or the $collation argument of a 'comparison' function, such as fn:compare() or fn:deep-equal()) that uses the scheme and path http://www.w3.org/2013/collation/UCA followed by an optional query part.

The query is a semicolon-separated sequence of zero or more keyword=value pairs, e.g. ...UCA?reorder=digit,space;strength=secondary. Full details of the query format and parameters can be found in The Unicode Collation Algorithm section of the specification. This section discusses the Saxon implementation.

Full support of UCA is only provided in Saxon-PE/EE from version 9.6 – Saxon-HE uses fallback behaviour described below.

Saxon-PE/EE uses the features of ICU - International Components for Unicode to support UCA. More detailed information is available from that site.

The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site. In the case that the ICU features have not been loaded within Saxon-PE/EE, fallback behaviour described below is used.

Specifics of the parameters and behaviour for Saxon-PE/EE implementation are:

keyword

values

default

Notes

fallback

yes | no

yes

fallback=no will raise errors in the case of unknown parameter keywords or values. Otherwise erroneous parameters are ignored.

lang

any value allowed for xml:lang, for example en-US for US English, or sr-Cyrl-ME for cyrillic script Serbian in Montenegro

From the locale

The implementation uses an appropriate collation for the requested locale from the ICU environment, splitting the lang parameter into three possible subcomponents: language-country-variant. The locale used may effect the default values of other parameters - see backwards for an example.

For a list of locales supported in this implementation see UCA-supported locales

version

string

6.2.0.0

The version of the UCA to be used. Interpreted as an ascending sequence of major.minor.update... version numbers. Requests for versions less than or equal to the current supported version are processed with the current version. Requests for a higher version raise an error.

strength

primary | secondary | tertiary | quaternary | identical, or 1 | 2 | 3 | 4 | 5 as synonyms

tertiary, but see notes

Default strength may be altered by a specific locale.

alternate

non-ignorable | shifted | blanked

non-ignorable

Behaviour with parameter alternate=blanked is indeterminate.

backwards

yes | no

no, but see notes

This is principally used for backwards-order comparison of (French) accents at secondary strength, and the default may be set by the locale used. For example lang=fr-CA implies a default of backwards=yes whereas lang=fr defaults to backward=no.

normalization

yes | no

no

normalization=yes has not been tested. See ICU documentation for further details.

caseLevel

yes | no

no

As specification

caseFirst

upper | lower

See notes

The default is to ignore case preferences.

numeric

yes | no

no

As specification

reorder

a comma-separated sequence of reorder codes, where a reorder code is one of space, punct, symbol, currency, digit, or a four-letter script code

As specification

UCA Fallback Behaviour

The specification supports fallback behaviour in the case that UCA is not, or only partially, implemented. If the query contains the parameter fallback=no and implementation of UCA is unsupported (as is the case with Saxon-HE, or when ICU features are not loaded), then the request will raise an error of 'unknown collation'. If fallback is yes or absent, the implementation will make best-effort.

For Saxon-HE, or in the absence of ICU, this involves building a tailored collation based on the Java library/Unicode implementation as described in Implementing a collating sequence, with the following re-mapping of parameters:

keyword(s)

values

effect

lang

language/locale code

Use an appropriate collation for the requested locale from the Java environment, if available, else codepoint collation.

strength

primary | secondary | tertiary | identical

used as stated.

1 | 2 | 3

remapped to primary | secondary | tertiary respectively.

quaternary | 4 | 5

remapped to identical.

caseFirst

upper | lower

remapped to case-order=upper-first and case-order=lower-first respectively.

numeric

yes | no

remapped to alphanumeric=yes and alphanumeric=no respectively.

version, alternate, backwards, normalization, caseLevel, reorder

-

All ignored.