Unicode Collation Algorithm

This section provides more detail on Saxon's support for the Unicode Collation Algorithm.

The Unicode Collation Algorithm is implemented using different libraries on different platforms, with differing levels of conformance:

This feature is requested by using a collation (specified either as a collation attribute on an xsl:sort instruction, an xsl:default-collation attribute within an XSLT tree, or the $collation argument of a comparison function, such as fn:compare() or fn:deep-equal()) that uses the scheme and path http://www.w3.org/2013/collation/UCA followed by an optional query part.

The query is a semicolon-separated sequence of zero or more keyword=value pairs, e.g. ...UCA?reorder=digit,space;strength=secondary. Full details of the query format and parameters can be found in The Unicode Collation Algorithm section of the specification. This section discusses the Saxon implementation.

Saxon supports the W3C-defined parameters as follows:

keyword

values

default

Notes

fallback

yes | no

yes

fallback=no will raise errors in the case of unknown parameter keywords or values. Otherwise erroneous parameters are ignored.

lang

any value allowed for xml:lang, for example en-US for US English, or sr-Cyrl-ME for cyrillic script Serbian in Montenegro

From the locale

The implementation uses an appropriate collation for the requested locale from the ICU environment, splitting the lang parameter into three possible subcomponents: language-country-variant. The locale used may effect the default values of other parameters - see backwards for an example.

For a list of locales supported in this implementation see UCA-supported locales

version

string

6.2.0.0

The version of the UCA to be used. Interpreted as an ascending sequence of major.minor.update... version numbers. Requests for versions less than or equal to the current supported version are processed with the current version. Requests for a higher version raise an error.

strength

primary | secondary | tertiary | quaternary | identical, or 1 | 2 | 3 | 4 | 5 as synonyms

tertiary, but see notes

Default strength may be altered by a specific locale.

maxVariable

space | punct | symbol | currency

punct

Determines which characters are considered as "noise" for the purposes of the alternate parameter. The default value punct causes whitespace and punctuation to be treated as noise characters. (Note that this includes characters that are obviously punctuation, like full-stop, comma, and parentheses, while excluding symbols such as the plus sign, equals sign, and copyright sign. But - (hyphen), #, & %, and * are classed as punctuation.)

alternate

non-ignorable | shifted | blanked

non-ignorable

This (poorly named) property controls the handling of "noise" characters such as spaces and punctuation. More specifically, it controls the handling of characters up to the value of maxVariable. For example if maxVariable=punct then it affects handling of whitespace and punctuation, while if maxVariable=currency then it also affects the handling of currency symbols. The value non-ignorable causes noise characters to be treated as first-class characters in their own right. The value shifted indicates that noise characters are treated as a quaternary distinction between strings (less significant than differences in accents or case), while blanked indicates that they are used only to distinguish strings that would otherwise be considered identical. The value blanked is not supported directly in the ICU library; if requested, it is handled by requesting alternate=shifted with strength=tertiary.

backwards

yes | no

no, but see notes

This is principally used for backwards-order comparison of (French) accents at secondary strength, and the default may be set by the locale used. For example lang=fr-CA implies a default of backwards=yes whereas lang=fr defaults to backward=no.

normalization

yes | no

no

normalization=yes has not been tested. See ICU documentation for further details.

caseLevel

yes | no

no

As specified by W3C.

caseFirst

upper | lower

See notes

The default is to ignore case preferences.

numeric

yes | no

no

As specified by W3C.

reorder

a comma-separated sequence of reorder codes, where a reorder code is one of space, punct, symbol, currency, digit, or a four-letter script code

As specified by W3C. Saxon testing revealed a bug in the ICU library which has been reported, but is not fixed at the time of writing.

Additional Saxon parameters

In addition to the standard parameters, Saxon supports some further parameters of its own:

keyword

values

effect

class

fully-qualified Java class name of a class that implements java.util.Comparator.

This parameter should not be combined with any other parameter. An instance of the requested class is created, and is used to perform the comparisons. Note that if the collation is to be used in functions such as contains() and starts-with(), this class must also be a java.text.RuleBasedCollator. This approach allows a user-defined collation to be implemented in Java. This option is also available on the .NET platform, but the class must implement the Java interface java.util.Comparator.

rules

details of the ordering required, using the syntax of the Java RuleBasedCollator

This defines exactly how individual characters are collated. (It's not very convenient to specify this as part of a URI, but the option is provided for completeness.) This option is also available on the .NET platform, and if used will select a collation provided using the OpenJDK implementation of RuleBasedCollator.

ignore-case

yes | no

Indicates whether the case of letters should be ignored: equivalent to strength=secondary.

ignore-modifiers

yes | no

Indicates whether non-spacing combining characters (such as accents and diacritical marks) are considered significant. Note that even when ignore-modifiers is set to "no", modifiers are less significant than the actual letter value, so that "Hofen" and "Höfen" will appear next to each other in the sorted sequence. Equivalent to strength=secondary.

ignore-symbols

yes | no

Indicates whether symbols such as whitespace characters and punctuation marks are to be ignored. This option currently has no effect on the Java platform, where such symbols are in most cases ignored by default.

ignore-width

yes | no

Indicates whether characters that differ only in width should be considered equivalent. On the Java platform, setting ignore-width sets the collation strength to tertiary.

decomposition

none | standard | full

Indicates how the collator handles Unicode composed characters. See the JDK documentation for details. This option is ignored on the .NET platform.

alphanumeric

yes | no | codepoint

If set to yes, the string is split into a sequence of alphabetic and numeric parts (a numeric part is any consecutive sequence of ASCII digits; anything else is considered alphabetic). Each numeric part is considered to be preceded by an alphabetic part even if it is zero-length. The parts are then compared pairwise: alphabetic parts using the collation implied by the other query parameters, numeric parts using their numeric value. The result is that, for example, AD985 collates before AD1066. (This is sometimes called natural sorting.) The value "codepoint" requests alphanumeric collation with the "alpha" parts being collated by Unicode codepoint, rather than by the default collation for the Locale. This may give better results in the case of strings that contain spaces. Note that an alphanumeric collation cannot be used in conjunction with functions such as contains() and substring-before().

case-order

upper-first | lower-first

Indicates whether upper case letters collate before or after lower case letters.

UCA fallback behaviour

The specification supports fallback behaviour in the case that UCA is not, or only partially, implemented. If the query contains the parameter fallback=no and implementation of UCA is unsupported (as is the case with Saxon-HE, or when ICU features are not loaded), then the request will raise an error of 'unknown collation'. If fallback is yes or absent, the implementation will make best-effort.

For Saxon-HE, or in the absence of ICU, this involves building a tailored collation based on the Java library/Unicode implementation, with the following re-mapping of parameters:

keyword(s)

values

effect

lang

language/locale code

Use an appropriate collation for the requested locale from the Java environment, if available, else codepoint collation.

strength

primary | secondary | tertiary | identical

used as stated.

1 | 2 | 3

remapped to primary | secondary | tertiary respectively.

quaternary | 4 | 5

remapped to identical.

caseFirst

upper | lower

remapped to case-order=upper-first and case-order=lower-first respectively.

numeric

yes | no

remapped to alphanumeric=yes and alphanumeric=no respectively.

version, alternate, backwards, normalization, caseLevel, reorder

-

All ignored.