XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (64 page)

BOOK: XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition
7.88Mb size Format: txt, pdf, ePub

The first stages in whitespace handling are the job of the XML parser and are done long before the XSLT processor gets to see the data. Remember that these apply both to source documents and to stylesheets:

  • End-of-line appearing in the textual content of an element is always normalized to a single newline
    x0A
    character. This eliminates the differences between line endings on Unix, Windows, and Macintosh systems. XML 1.1 introduces additional rules to normalize the line endings found on IBM mainframes.
  • The XML parser will normalize attribute values. A tab or newline will always be replaced by a single space, unless it is written as a character reference such as

    or

    ; for some types of attribute (anything except type
    CDATA
    ), a validating XML parser will also remove leading and trailing whitespace, and normalize other sequences of whitespace to a single space character.

This attribute normalization can be significant when the attribute in question is an XPath expression in the stylesheet. For example, suppose you want to test whether a string value contains a newline character. You can write this as follows:


It's important to use the character reference

here, rather than a real newline, because a newline character would be converted to a space by the XML parser, and the expression would then actually test whether the supplied string contains a space.

What this means in practice is that if you want to be specific about whitespace characters, write them as character references; if you just want to use them as separators and padding, use the whitespace characters directly.

The XSLT specification assumes that the XML parser will hand over all whitespace text nodes to the XSLT processor. However, the input to the XSLT processor is technically a tree, and the XSLT specification claims no control over how this tree is built. If you use Microsoft's MSXML, or Altova's XSLT processor, then the default action of the parser while building the tree is to remove whitespace text nodes. If you want the parser to behave the way that the XSLT specification expects, you must set configuration options to make this happen; see the vendors' documentation for details.

Once the XML parser has done its work, further manipulation of whitespace may be done by the schema processor. This is more likely to affect source documents than stylesheets, because there is little point in putting a stylesheet through a schema processor. For each simple data type, XML Schema defines whitespace handling (the so-called whitespace facet) as one of three options:

  • Preserve:
    All whitespace characters in the value are preserved. This option is used for the data type
    xs:string
    .
  • Replace:
    Each newline, carriage return, and tab character is replaced by a single-space character. This option is used for the data type
    xs:normalizedString
    and types derived from it.
  • Collapse:
    Leading and trailing whitespace is removed, and any internal sequence of whitespace characters is replaced by a single space. This option is used for all other data types (including those where internal whitespace is not actually allowed).

When source documents are processed using a schema, the XDM rules say that for attributes, and for elements with simple content (that is, elements that can't have child elements), the typed value of the element or attribute is the value after whitespace normalization has been done according to the XML Schema rules for the particular data type. The
string
value of an element or attribute may either be the value as originally written or the value obtained by converting the typed value back to a string—implementations are allowed to choose either approach. In the latter case, insignificant leading and trailing whitespace may be lost. However, the
string()
function itself is almost the only thing that depends on the string value of a node; most expressions use the typed value.

Finally, the XSLT processor applies some processing of its own. By this time entity and character references have been expanded, so there is no difference between a space written as a space and one written as

:

  • Adjacent text nodes are merged into a single text node (
    normalized
    in the terminology of the DOM).
  • Then, if a text node consists entirely of whitespace, it is removed (or
    stripped
    ) from the tree if the containing element is listed in an

    definition in the stylesheet. The detailed rules are more complex than this, and also take into account the presence of the
    xml:space
    attribute in the source document; see the

    element on page 492 in Chapter 6 for details.

Other books

Targets Entangled by Layne, Kennedy
Annihilation by Jeff Vandermeer
The Safe Man by Michael Connelly
Rough Trade by edited by Todd Gregory
A fine and bitter snow by Dana Stabenow