XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (237 page)

BOOK: XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition
5.03Mb size Format: txt, pdf, ePub

XSLT 2.0 and XPath 2.0 share similar mechanisms for dealing with collations, because they are needed not only in sorting but also in defining what operators such as
eq
mean, and in functions such as
distinct-values()
. The assumption behind the design is that many computing environments (for example the Windows operating system, the Java virtual machine, or the Oracle database platform) already include extensive mechanisms for defining and customizing collations, and that XSLT processors will be written to take advantage of these. As a result, sorting order will not be identical between different implementations.

The basic model is that a collation (a set of rules for determining string ordering) is identified by a URI. Like a namespace URI, this is an abstract identifier, not necessarily the location of a document somewhere on the Web. The form of the URI, and its meaning, is entirely up to the implementation. There is a proposal (RFC 4790) for IANA (the Internet Assigned Numbers Authority) to set up a register of collation names, but even if this comes to fruition, it will still be up to the implementation to decide whether to support these registered collations or not. Until such time, the best you can do to achieve interoperability is pass the collation URI to the stylesheet as a parameter; the API can then sort out the logic for choosing different collations according to which processor you are using.

The Unicode consortium has published an algorithm for collating strings called the Unicode Collation Algorithm (see
http://www.unicode.org/unicode/reports/tr10/index.html
). Although the XSLT specification refers to this document, it doesn't say that implementations have to support it. In practice, many of the facilities available in platforms such as Windows and Java are closely based on this algorithm. The Unicode Collation Algorithm is not itself a collation, because it can be parameterized. Rather, it is a framework for defining a collation with the particular properties that you are looking for.

You can specify the URI of the collation to be used in the
collation
attribute of the

element. This is an attribute value template, so you can write

to use a collation that has been passed to the stylesheet as a parameter.

There is one collation URI that every implementation is required to support, called the
Unicode codepoint collation
(not to be confused with the Unicode Collation Algorithm mentioned earlier). This is selected using the URI

http://www.w3.org/2005/xpath-functions/collation/codepoint

Under the codepoint collation, strings are simply compared using the numeric code values of the characters in the string: if two characters have the same Unicode codepoint they are equal, and if one has a numerically lower Unicode codepoint, then it comes first. This isn't a very sophisticated or user-friendly algorithm, but it has the advantage of being fast. If you are sorting strings that use a limited alphabet, for example part numbers, then it is probably perfectly adequate.

Codepoint collation is subtly different from string comparisons in languages such as Java. Java represents Greek Zero Sign (x1018A) as a surrogate pair (xD800, xDD8A), and therefore sorts it before Wavy Overline (xFE4B). In XSLT, Wavy Overline comes first because its codepoint is lower.

If you specify a collation that the implementation doesn't recognize, then it raises an error. However, the word “recognize” is deliberately vague. An implementation could choose to recognize every possible collation URI that you might throw at it, and never raise this error at all. More probably, an implementation might decide to use parameterized URIs (for example, allowing a component such as
language=fr
to select the target language), and it's then an implementation decision whether to “recognize” a URI that contains invalid or missing parameters.

If you don't specify the collation attribute on

, you can provide a hint as to what kind of collation you want by specifying the
lang
and/or
case-order
attributes. These are retained from XSLT 1.0, which didn't support explicit collation URIs, but they are still available for use in 2.0.

  • The
    lang
    attribute specifies the language whose collation rules are to be used (this might be the language of the data, or the language of the target user). Its value is specified in the same way as the standard
    xml:lang
    attribute defined in the XML specification, for example
    lang=“en-US”
    refers to U.S. English and
    lang=“fr-CA”
    refers to Canadian French.
  • Knowing the language doesn't help you decide whether upper-case or lower-case letters should come first (every dictionary in the world has its own rules on this), so XSLT makes this a separate attribute,
    case-order
    . Generally, case order will be used only to decide the ordering of two words that compare equal if case is ignored. For example, in German, where an initial upper-case letter can change the meaning of a word, some dictionaries list the adjective
    drall
    (meaning plump or buxom) before the unrelated noun
    Drall
    (a swerve, twist, or bias), while others reverse the order. Specifying
    case-order=“lower-first”
    would place
    drall
    immediately before
    Drall
    , while
    case-order=“upper-first”
    would have
    Drall
    immediately followed by
    drall
    .

Other books

Beauty and the Chief by Alysia S Knight
Eternity by Heather Terrell
An Invisible Murder by Joyce Cato
An Army of Good by K.D. Faerydae
Claws of the Dragon by Craig Halloran
The Brendan Voyage by Tim Severin
Now You See Me ... by Jane B. Mason