XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition (125 page)

BOOK: XSLT 2.0 and XPath 2.0 Programmer's Reference, 4th Edition
7.02Mb size Format: txt, pdf, ePub

While the

element is being evaluated, the captured groups found during the regular expression match are available using the
regex-group()
function. This takes an integer argument, which is the number of the captured group that is required. If there is no corresponding subexpression in the regular expression, or if that subexpression didn't match anything, the result is a zero-length string.

Usage and Examples

Many tasks that require regex processing can be accomplished using the three functions in the core function library (see Chapter 13) that use regular expressions:
matches()
,
replace()
, and
tokenize()
. These are used as follows:

Function
Purpose
matches()
Tests whether a string matches a given regular expression
replace()
Replaces the parts of a string that match a given regular expression with a different string
tokenize()
Splits a string into a sequence of substrings, by finding occurrences of a separator that matches a given regular expression

There are many ways to use these functions in an XSLT stylesheet. For example, you might write a template rule that matches customers with a customer number in the form 999-AAAA-99 (this might be the only way, for example, that you can recognize customers acquired as a result of a corporate takeover). Write this as:

match=“customer[matches(cust-nr, ‘

[0-9]{3}-[A-Z]{4}-[0-9]{2}$’)]”>

There is no need to double the curly braces in this example. The
match
attribute of

is not an attribute value template, so curly braces have no special significance.

The

instruction is more powerful (but also more complex) than any of these three functions. In particular, none of the three XPath functions can produce new elements or other nodes. The

instruction can do so, which makes it very useful when you want to find a non-XML structure in the source text (for example, the comma-separated list of numbers mentioned earlier) and convert it into an XML representation (a sequence of elements, say). This is sometimes called up-conversion.

There are two main ways of using

, which I will describe as single-match and multiple-match applications. I shall give an example of each.

A Single-Match Example

In the single-match use of

, a regex is supplied that is designed to match the entire input string. The purpose is to extract and process the various parts of the string using the captured groups. This is all done within the

child element, which is only invoked once. The

element is used only to define error handling, to deal with the case where the input doesn't match the expected format.

For example, suppose you want to display a date as 13
th
March 2008. To achieve this, you need to generate the output
13thMarch 2008
(or rather, text nodes and element nodes corresponding to this serial XML representation). You can achieve the basic date formatting using the
format-date()
function described in Chapter 13, but to add the markup you need to post-process the output of this function.

Here is the code (for the full stylesheet see
single-match.xsl
in the download archive):

             select=“format-date(current-date(), ‘[D1o]#[MNn]#[Y]’)”

             regex=“

([0-9]+)([a-z]+)#([A-Z][a-z]+)#(.*)$”>

   

      

      

       

      

       

      

   

   

      

   


Note that the regex is anchored (it starts with

and ends with
$
) to force it to match the whole input string. Unlike regex expressions used in the pattern facet in XML Schema, a regex used in the

instruction is not implicitly anchored.

In this example I chose in the

to output the whole date as returned by
format-date()
, without any markup. This error might occur, for example, because the stylesheet is being run in a locale that uses an unexpected representation of ordinal numbers. The alternative would be to call

to report an error and perhaps terminate.

A Multiple-Match Example

In a multiple-match application, you supply a regular expression that will match the input string repeatedly, breaking it into a sequence of substrings. There are two main ways you can design this:

1.
Match the parts of the string that you are interested in. For example, the regex
[0-9]+
will match any sequence of consecutive digits, and pass it to the

element to be processed. The characters that separate groups of digits are passed to the

element, if there is one (you might choose to ignore them completely).

There is a variant of this approach that is useful where there are no separators as such. For example, you might be dealing with a format such as the one used for ISO 8601 durations, which look like this:
P12H30 M10 S
, with the requirement to split out the components
12H
,
30 M
, and
10 S
. The regex
[0-9]+[A-Z]
will achieve this, passing each component to the

element in turn.

2.
Match the separators between the parts of the string that you are interested in. For example, if the string uses a comma as a separator, the regex
,\s*
will match any comma followed optionally by spaces. The fields that appear between the commas will be passed, one at a time, to the

element, while the separators (if you want to look at them at all) are passed to the

element.

Other books

The Last Life by Claire Messud
Needles & Sins by John Everson
Groomless - Part 1 by Sierra Rose
Kiss the Sky by Krista Ritchie, Becca Ritchie
Kindergarten by Peter Rushforth
Forstaken by Kerri Nelson
Taken by Barbara Freethy
Flint and Roses by Brenda Jagger