Numbers and dates from ICU

ICU - International Components for Unicode (ICU) provides extensive facilities for localized numbering and date formatting, which are supported in Saxon-PE and -EE from version 9.6.

The ICU features require a sizeable (~7MByte) library which may be supplied either in the main JAR file, or as a separate JAR, which can itself either be a 'minimised' version in the Saxonica distribution, or a complete ICU4J JAR downloaded from the ICU site.

In the case that the ICU features have not been loaded within Saxon-PE/EE, support for numbering and dates for Danish, Dutch, Flemish, French (and Belgian French), German, Italian and Swedish is provided as detailed in the table of Numberings for selected languages.

Numbering

ICU supports a large set of language-specific rulesets for supporting different forms of spelled-out numbering and digit-ordinal treatment. For example:

LanguageNumberRulesetXPathResult
en
English
1123%spellout-cardinalformat-integer(1123,'Ww','en-x-sc')One Thousand One Hundred Twenty Three
en-US
English(US)
242%spellout-ordinal-verboseformat-integer(242,'Ww','en-US-x-sov')Two Hundred and Forty Second
cy
Welsh
132%spellout-cardinal-feminineformat-integer(132,'Ww','cy-x-scf')Un Cant Tri Deg Dwy
es-PT
Portugese
242%spellout-ordinal-feminineformat-integer(242,'Ww','es-PT-x-sof')Ducentésima Cuadragésima Segunda
en
English
1123format-integer(1123,'1;o','en')1123rd
en-US
English(US)
242format-integer(242,'1;o','en-US')242nd
cy
Welsh
132format-integer(132,'1;o','cy')132.
es-PT
Portugese
242format-integer(242,'1;o','es-PT')242.º

In all cases it appears to be that the numbering scheme names and their behaviour is taken from the base language, with no regional variation - i.e. es-PT and es-419 would produce the same results.

Explicitly named numbering schemes

The simplest way to invoke a specific ICU numbering scheme (or spellout ruleset) is to name it explictly. In format-integer() this can be done within the parentheses in the format picture, for example:

format-integer(@value,'1;o(%spellout-ordinal-verbose)',$lang)

When this approach is used, it does not matter whether "o" (for ordinal) or "c" (for cardinal) is specified; the chosen spellout rules take precedence.

Similarly, with xsl:number, you can specify ordinal="%spellout-ordinal-verbose". Again, it does not matter whether the numbering scheme is cardinal, ordinal, or something else: as noted in the XSLT 3.0 specification, the attribute name ordinal is a misnomer.

If a spellout name is chosen that does not exist for the chosen language, Saxon attempts to fall back to a default scheme, it does not report an error.

See Supported ICU numbering schemes for a full list of the names of numbering schemes available in different languages.

Alternative ways to specify numbering schemes

To avoid such a direct dependency on ICU spellout names, and for compatibility with earlier Saxon and XSLT versions, alternative mechanisms are available, as described below.

To invoke one of these schemes, an IETF BCP47 standard private tag can be appended to the language tag, with format -x-code. With a very small number of exceptions (to avoid clashes) these codes are encoded as the sequence of first letters of each word, the result being all lower case: for example %spellout-ordinal-verbose with American English may be invoked using language code en-US-x-sov. Further examples of use are shown in the table above. These private tag codes are recognised for 'word' numbering purposes on both format-integer() and xsl:number language arguments.

A full list of the scheme codes and their support in a given language is given in Supported ICU numbering schemes.

In the absence of such a private tag, the following strategies are adopted:

Cardinal spellout
When directed to generate a cardinal number using the 'w' patterns, the first of the following schemes is used, if found: spellout-cardinal-verbose, spellout-cardinal, spellout-cardinal-native, spellout-cardinal-neuter, spellout-cardinal-feminine, spellout-cardinal-masculine. It appears that within ICU all languages contain at least one of these schemes, but if not, any scheme whose name matches the regular expression ^%spellout-cardinal is used (choosing the first provided for the locale).
Ordinal spellout
When directed to generate an ordinal number using the 'w' patterns, the first of the following schemes is used, if found: spellout-ordinal-verbose, spellout-ordinal, spellout-ordinal-native, spellout-ordinal-neuter, spellout-ordinal-feminine, spellout-ordinal-masculine. In the absence of any of these schemes, any scheme whose name matches the regular expression ^%spellout-ordinal is used (choosing the first provided for the locale). In the case of there being no ordinal scheme available for the locale (or a language that does not have ordinals) the default cardinal scheme is used.
Ordinal digits
When producing an ordinal digit suffix (e.g. 13.º in Spanish), the 'digit-ordinal' ruleset is used by default - for those cases where there are specialist forms (e.g. Catalan and Spanish), the private tag must be set to get the specialist behaviour.

English numbering

ICU renders 22 in words as "twenty-two", whereas Saxon has traditionally output "twenty two", with a space rather than hyphen separator. By default, for compatibility, the ICU result for all English schemes (cardinal and ordinal) is modified to use space as the separator. This can be modified by adding the extension hyphen or nohyphen to the language code: for example en-x-hyphen produces "twenty-two" while en-x-nohyphen produces "twenty two". This may be combined with other modifiers, for example en-x-sov-hyphen gives hyphenated verbose output.

The default use of a '-verbose' scheme means that spellout of 118 yields "one hundred and eighteen", following the British usage rather than "one hundred eighteen", which is the (US) shortened form.

Using format-picture options within spelled-out numbering

format-number() and format-integer() can support further implementation-dependent control of numbering though parameters attached to cardinal (c) and ordinal (o) modifiers, e.g. format-integer(1,'Ww;o(-er)','de') which is intended to produce "Erster" in the recommended approach.

In Saxon-PE/EE, there are two forms of such parameterisation supported for spelled-out (i.e. W|w) formats:

For German (lang="de"), ICU at one time did not provide case/gender-variable ordinal word numbering (i.e. only "Erste" and not "Erster"). The Saxon implementation therefore supports the -suffix ordinal option described above, which replaces the trailing 'e' on the generated ordinal. Thus format-integer(1,'w;o(-en)','de') produces "ersten".

Dates

ICU also provides facilities for localized date formatting, principally for names of days of the week and months, though a variety of calendars and epoch naming facilities are also available. In Saxon-PE/EE naming of months and days (through picture fields on format-date()) is localised through the local language in scope. These appear to be all based on the base language, with no regional variations.

As is required from the specification, when the language locale requested is not implemented, the result of format-date() or format-dateTime() is prefixed with "[Language: default language code]".

Title case

Protocols for title case of sequences of words can differ markedly between languages, with many keeping strictly to lower-case throughout, and a very few (such as Dutch with 'iJ') having very specialist rules. In Saxon, when title case is requested (e.g. Ww in format-integer() or [MNn] in format-date()) the Saxon implementation follows these rules:

Note that this may have problems, i.e. a title case could be forced on a language that otherwise might only ever use a uniform case. If you discover issues in a language you are using, please let us know.