<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-stylesheet href="../make-menu.xsl" type="text/xsl"?><html>
   <head>
      <this-is section="extensibility" page="collation" subpage=""/>
      <!--
           Generated at 2011-12-09T20:47:22.916Z--><title>Saxonica: XSLT and XQuery Processing: Implementing a collating sequence</title>
      <meta name="coverage" content="Worldwide"/>
      <meta name="copyright" content="Copyright Saxonica Ltd"/>
      <meta name="title"
            content="Saxonica: XSLT and XQuery Processing: Implementing a collating sequence"/>
      <meta name="robots" content="noindex,nofollow"/>
      <link rel="stylesheet" href="../saxondocs.css" type="text/css"/>
   </head>
   <body class="main">
      <h1>Implementing a collating sequence</h1>
      <p>Collations used for comparing strings can be specified by means of a URI. A collation URI may
be used as an argument to many of the <a class="bodylink" href="../functions/intro.xml">standard functions</a>, and 
also as an attribute of <code>xsl:sort</code> in XSLT, and in the <code>order by</code>
clause of a FLWOR expression in XQuery.</p>
      <p>Saxon provides a range of mechanisms for binding collation URIs. The language specifications simply say
that collations used in sorting and in string-comparison functions are identified by a URI, and leaves it up to 
the implementation how these URIs are defined.</p>
      <p>There is one predefined collation that cannot be changed. This is the Unicode Codepoint Collation defined in the
W3C specifications <code>http://www.w3.org/2005/xpath-functions/collation/codepoint</code>.
This collates strings based on the integer values assigned by Unicode to each character, for example "ah!" sorts before
"ah?" because the Unicode codepoints for "ah!" are (97, 104, 33) while the codepoints for "ah?" are (97, 104, 63).</p>
      <p>You can use the Saxon configuration file to define collations: 
see <a class="bodylink"
            href="../configuration/configuration-file/config-collations.xml">The collations element</a>.</p>
      <p>In addition, by default, Saxon allows a collation URI to take the form
<code>http://saxon.sf.net/collation?keyword=value;keyword=value;...</code>. The query parameters
in the URI can be separated either by ampersands or semicolons, but semicolons are usually more
convenient.</p>
      <p>The same keywords are available on the Java and .NET platforms, but because of differences in 
collation support between the two platforms, they may interact in slightly different ways. The same
collation URI may produce different sort orders on the two platforms. (One noteworthy difference is
that the Java collations treat spaces as significant, the .NET collations do not.) </p>
      <p>The keywords available in such a collation URI are the same as in the configuration file, and are as follows:</p>
      <table>
         <thead>
            <tr>
               <td content="para">
                  <p>
                  <b>keyword</b>
               </p>
               </td>
               <td content="para">
                  <p>
                  <b>values</b>
               </p>
               </td>
               <td content="para">
                  <p>
                  <b>effect</b>
               </p>
               </td>
            </tr>
         </thead>
         <tbody>
            <tr>
               <td content="para">
                  <p>class</p>
               </td>
               <td content="para">
                  <p>fully-qualified Java class name of a class that
implements <code>java.util.Comparator</code>.</p>
               </td>
               <td content="para">
                  <p>This parameter should not be combined with any other parameter.
An instance of the requested class is created, and is used to perform
the comparisons. Note that if the collation is to be used
in functions such as <code>contains()</code> and <code>starts-with()</code>, this class must also be a
<code>java.text.RuleBasedCollator</code>. This approach allows a user-defined collation
to be implemented in Java.This option is also available on the .NET platform, but the class must implement
the Java interface java.util.Comparator.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>rules</p>
               </td>
               <td content="para">
                  <p>details of the ordering required, using the syntax of the Java 
<code>RuleBasedCollator</code>
               </p>
               </td>
               <td content="para">
                  <p>This defines exactly how individual characters are collated. (It's not very
convenient to specify this as part of a URI, but the option is provided for completeness.)
This option is also available on the .NET platform, and if used will select a collation
provided using the OpenJDK implementation of <code>RuleBasedCollator</code>.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>lang</p>
               </td>
               <td content="para">
                  <p>any value allowed for <code>xml:lang</code>, for example <code>en-US</code> for US English</p>
               </td>
               <td content="para">
                  <p>This is used to find the collation appropriate to a Java locale or .NET culture. 
The collation may be further tailored using the parameters described below.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>ignore-case</p>
               </td>
               <td content="para">
                  <p>yes, no</p>
               </td>
               <td content="para">
                  <p>Indicates whether the upper and lower case letters are considered
equivalent. Note that even when ignore-case is set to "no", case is less significant than
the actual letter value, so that "XPath" and "Xpath" will appear next to each other in the
sorted sequence.On the Java platform, setting ignore-case sets the collation strength to secondary.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>ignore-modifiers</p>
               </td>
               <td content="para">
                  <p>yes, no</p>
               </td>
               <td content="para">
                  <p>Indicates whether non-spacing combining characters 
(such as accents and diacritical marks) are considered
significant. Note that even when ignore-modifiers is set to "no", modifiers are less significant than
the actual letter value, so that "Hofen" and "Höfen" will appear next to each other in the
sorted sequence.On the Java platform, setting ignore-case sets the collation strength to primary.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>ignore-symbols</p>
               </td>
               <td content="para">
                  <p>yes, no</p>
               </td>
               <td content="para">
                  <p>Indicates whether symbols such as whitespace characters and punctuation
marks are to be ignored. This option currently has no effect on the Java platform, where
such symbols are in most cases ignored by default.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>ignore-width</p>
               </td>
               <td content="para">
                  <p>yes, no</p>
               </td>
               <td content="para">
                  <p>Indicates whether characters that differ only in width should be considered
equivalent.On the Java platform, setting ignore-width sets the collation strength to tertiary.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>strength</p>
               </td>
               <td content="para">
                  <p>primary, secondary, tertiary, or identical</p>
               </td>
               <td content="para">
                  <p>Indicates the differences that are considered significant when comparing
two strings. A/B is a primary difference; A/a is a secondary difference;
a/ä is a tertiary difference (though this varies by language). So
if strength=primary then A=a is true; with strength=secondary 
then A=a is false but a=ä is true; with strength=tertiary
then a=ä is false.This option should not be combined with the ignore-XXX options. The setting "primary" is
equivalent to ignoring case, modifiers, and width; "secondary" is equivalent to ignoring
case and width; "tertiary" ignores width only; and "identical" ignores nothing.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>decomposition</p>
               </td>
               <td content="para">
                  <p>none, standard, full</p>
               </td>
               <td content="para">
                  <p>Indicates how the collator handles Unicode composed characters. See
the JDK documentation for details. This option is ignored on the .NET platform.</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>alphanumeric</p>
               </td>
               <td content="para">
                  <p>yes, no, codepoint</p>
               </td>
               <td content="para">
                  <p>If set to yes, the string is split into a sequence of alphabetic and numeric parts (a numeric
part is any consecutive sequence of ASCII digits; anything else is considered alphabetic). Each numeric part is considered
to be preceded by an alphabetic part even if it is zero-length. The parts are then compared pairwise: alphabetic parts
using the collation implied by the other query parameters, numeric parts using their numeric value. The result is that,
for example, AD985 collates before AD1066. (This is sometimes called natural sorting.)
                  The value <code>codepoint</code> requests alphanumeric collation with the
"alpha" parts being collated by Unicode codepoint, rather than by the default collation for the Locale. This may give better
results in the case of strings that contain spaces. Note that an alphanumeric collation cannot be used in conjunction with functions such as contains() and substring-before().</p>
               </td>
            </tr>
            <tr>
               <td content="para">
                  <p>case-order</p>
               </td>
               <td content="para">
                  <p>upper-first, lower-first</p>
               </td>
               <td content="para">
                  <p>Indicates whether upper case letters collate before or after lower case
letters.</p>
               </td>
            </tr>
         </tbody>
      </table>
      <p>This format of URI, <code>http://saxon.sf.net/collation?keyword=value;keyword=value;...</code>,
is handled by Saxon's default <a class="bodylink" href="../javadoc/net/sf/saxon/lib/CollationURIResolver.html"><code>CollationURIResolver</code></a>. It is possible to replace or supplement
this mechanism by registering a user-written <code>CollationURIResolver</code>. This must be an implementation
of the Java interface <code>net.sf.saxon.lib.CollationURIResolver</code>, which only requires a single method, 
<code>resolve()</code>, to be implemented. The result of the method is in general a Java <code>Comparator</code>,
though if the collation is to be used in functions such as <code>contains()</code> which match parts of a string
rather than the whole string, then the result must also be an instance of either <code>java.text.RuleBasedCollator</code>,
or of the Saxon interface <code>net.sf.saxon.sort.SubstringMatcher</code>.</p>
      <p>In the Java API, a user-written <code>CollationURIResolver</code> is registered with the <a class="bodylink" href="../javadoc/net/sf/saxon/Configuration.html"><code>Configuration</code></a> object,
either directly or in the case of XSLT by using the JAXP <code>setAttribute()</code> method on the
<code>TransformerFactory</code> (the relevant property name is <a class="bodylink"
            href="../javadoc/net/sf/saxon/lib/FeatureKeys.html#COLLATION_URI_RESOLVER"><code>FeatureKeys.COLLATION_URI_RESOLVER</code></a>).
This applies to all stylesheets and queries compiled and executed under that configuration.</p>
      <p>It is also possible to register a collation (for example as an instance of the Java class <code>Collator</code>
or <code>Comparator</code> with the <code>Configuration</code>. Such explicitly registered collations (together with
those registered via the configuration file) are used before calling the <code>CollationURIResolver</code>.
In addition, the APIs provided for executing XPath and XQuery expressions allow named collations to
be registered by the calling application, as part of the static context.</p>
      <p>At present there are no equivalent facilities in the .NET API (other than the use of the configuration file), 
though it is possible to manipulate collations by dropping down into the Java interface.</p>
      <table width="100%">
         <tr>
            <td>
               <p align="right"><a class="nav" href="localizing.xml">Next</a></p>
            </td>
         </tr>
      </table>
   </body>
</html>
