<?xml version="1.0"?><!DOCTYPE article SYSTEM "dtd/ideadb.dtd">
<?xml-stylesheet href="docbook-css-0.4/driver.css" type="text/css"?>
<article>
<title>Up-conversion using XSLT 2.0</title>
<articleinfo>
<author><firstname>Michael</firstname><surname>Kay</surname>
<affiliation>
<jobtitle>Director</jobtitle>
<orgname>Saxonica Limited <ulink url="http://www.saxonica.com/">http://www.saxonica.com/</ulink></orgname>
<address>
<city>Reading</city>
<country>United Kingdom</country>
<email>mike@saxonica.com</email>
</address></affiliation>
<personblurb><para>Michael Kay is the editor of the XSLT 2.0 specification, the developer
of the Saxon XSLT and XQuery processor, and the author of the books 
<emphasis role="italic">XSLT 2.0 Programmer's Reference</emphasis> and
<emphasis role="italic">XPath 2.0 Programmer's Reference</emphasis> from Wrox Press.</para>

<para>He founded Saxonica Limited earlier this year to continue the development of
the Saxon software and to provide consultancy services to XSLT and XQuery users.</para>
</personblurb>
</author>
<keywordset>
<keyword>XSLT</keyword>
</keywordset>
</articleinfo>
<abstract>

<para>XSLT 2.0 provides a wide range of new features, many of which make
light work of tasks that are notoriously difficult in XSLT 1.0, such as grouping
and string manipulation. This paper attempts to show how these facilities not only make
coding easier, but will also extend the scope of the language making it possible
to tackle problems that were quite outside the range of XSLT 1.0.
</para>
     

<para>The paper shows case study of a multi-phase transformation
taking data from a legacy ASCII-based interchange format, to 
XML based on a standardized vocabulary.
The transformations illustrate the power of new features including regular
expression handling, grouping, recursive functions, and schema-aware processing.
</para>

<para>The conclusion of the paper is that these new facilities - notably 
regular expression handling and grouping - take XSLT into new territory, where 
languages such as Perl previously reigned supreme. XSLT 1.0 works best where 
all the structure in a document is already identified by markup. XSLT 2.0 will 
also be able to handle many situations where the structure is implicit in the text, 
or in markup designed for presentation purposes rather than to capture the information 
semantics. It thus becomes a powerful tool for "up-conversion" applications. 
These facilities work will in conjunction with schema-aware processing, 
where the aim of the exercise is to create XML that conforms to a target schema.
</para>
</abstract>
<section><title>Introduction</title>
<para>
XSLT 1.0 became a W3C Recommendation in November 1999; it has attracted at least twenty
 implementations and a very sizeable user base. It is used mainly for two distinct
 applications: rendering of XML documents by converting them into a presentation-oriented
 vocabulary (usually HTML, sometimes XSL-FO, XHTML, or SVG); and conversion of data-oriented
 XML messages, either into a different vocabulary, or to a different document using the same
 vocabulary, but with different information content. Within these two categories there
 are some highly creative and innovative applications, a notable example being
 Schematron, which uses XSLT transformations to apply structural and semantic validation
 rules to a document.</para>
 <para>
 Although XSLT 1.0 is designed to transform source XML trees into result XML trees, it
 also includes three serialization methods, allowing the result tree to be output either as
 lexical XML, HTML, or text. This enables a wide range of applications in which the output
 is in textual form: I have seen XSLT stylesheets that generated Java programs, SQL code,
 comma-separated-values files, and EDI messages.
</para>
<para>However, this ability to generate multiple output formats is not mirrored on the input
side. XSLT 1.0 has very little capability to take anything other than XML as its input. There are
ways around this: for example in the first edition of my book 
<emphasis role="italics">XSLT Programmer's Reference</emphasis> I showed how one could write
a parser for a non-XML format such as the GEDCOM 5.5 format used for genealogical data, and by
making this parser implement the SAX interface supported by many XSLT processors, one could 
present the parsed input data to the XSLT 1.0 processor as if it came from an XML parser.
However, this is really only a minor improvement on what can be achieved by writing a GEDCOM-to-XML
converter as a standalone application.</para>
<para>XSLT 2.0, as I will show in this paper, greatly extends the ability of XSLT to process any
textual input, without the need to write conversion code in Java or another procedural programming
language. It therefore enables XSLT to be used not only for XML-to-XML and XML-to-text applications,
but also for text-to-XML conversions. More generally, it allows XSLT 2.0 to be used for up-conversion.
</para>
<para>In the broadcasting industry, the term <emphasis role="italics">upconversion</emphasis> 
(usually without a hyphen) is used
to mean the conversion of a low-frequency video format to an equivalent high-frequency format.
In the SGML and XML world, the word refers to the generation of a format with detailed markup
from a format with less-detailed or no markup, where it is necessary to generate the additional
markup by recognizing structural patterns that are implicit in the textual content itself.
By extension the term is also used for converting non-SGML or non-XML markup into SGML/XML: this
usage is justified, of course, on the basis that SGML/XML is obviously on a higher plane than
any alternative markup language!</para>

<para>I will start this paper with a survey of the new features in XSLT 2.0 that make it easier
to write up-conversion transforms (it really doesn't make much sense to call them stylesheets
any more, but I will slip into that usage occasionally). I will then present a case study of
a particular up-conversion. I will use the example I mentioned earlier, conversion of GEDCOM
genealogical data: but this time, the entire job will be done in XSLT 2.0, with no need to write
preprocessing software in a procedural language.</para>

</section>
<section>
<title>Up-Conversion Facilities in XSLT 2.0</title>

<para>In this section I will describe how four of the new features in XSLT 2.0 can be used
to assist in writing up-conversion applications. The four features discussed are:</para>

<itemizedlist>
<listitem><para>The unparsed-text() function</para></listitem>
<listitem><para>Regular expression processing</para></listitem>
<listitem><para>Grouping facilities</para></listitem>
<listitem><para>Schema-aware processing</para></listitem>
</itemizedlist>

<para>The descriptions here are brief introductions to these facilities: for full information,
see the W3C specifications of XSLT 2.0 <xref linkend="xslt20"/> and XPath 2.0 
<xref linkend="xpath20"/>, or my books 
<emphasis role="italic">XSLT 2.0 Programmer's Reference</emphasis> <xref linkend="xslt20pr"/> and
<emphasis role="italic">XPath 2.0 Programmer's Reference</emphasis> <xref linkend="xpath20pr"/>. </para>



<section><title>The unparsed-text() function</title>

<para>In order to handle non-XML input, the first thing a stylesheet needs to be able to do is
to read it. For this purpose, XSLT 2.0 provides the unparsed-text() function. This takes a URI as its
first argument, and loads the text of the resource found at that URI. The result is a character
string - that is, a value of type <code>xs:string</code>, where "xs" is the XML Schema namespace. 
The type system of XSLT 2.0 is based on the types defined in the XML Schema specification.</para>

<para>In fact, it was already possible in XSLT 1.0 to provide a stylesheet with non-XML input,
 in the form of a string-valued stylesheet parameter (parameters can be declared using a global
 &lt;xsl:param&gt; element). However, this imposes constraints, for example it is difficult to 
 handle a variable number of such inputs. Allowing URI-addressible resources to be accessed 
 directly makes the job much easier.</para>

<para>Character encoding is of course a problem. The unparsed-text() function allows a second
parameter to specify the character encoding explicitly, or it can be guessed from external 
information - the XSLT 2.0 spec refers to the algorithms and heuristics defined in the 
XLink specification for this purpose. In practice, if the file is an arbitrary file in operating
system filestore with no associated metadata, guessing its encoding is sometimes going to give 
wrong answers. Sadly, there is no easy solution to this difficulty.</para>

<para>The fact that the result of the unparsed-text() function must be an <code>xs:string</code> imposes a
constraint: the only characters allowed in the file are those permitted in XML documents. This
same constraint also applies to any text output produced by a stylesheet. It means that XSLT
is now capable of reading textual input and writing textual output, but it cannot be used to
handle binary input or binary output, unless these are first translated into some textual
representation.</para>
</section>
<section><title>Regular expression processing</title>

<para>XSLT 1.0 has been much criticized for its rather primitive text-handling capabilities:
the function library provided for string handling in XPath 1.0 is designed very much on
"reduced instruction set computing" principles - you can achieve pretty well anything, but 
the complexity of the programming needed even for some quite simple tasks can be daunting. In 
particular, for many users (whether or not they have a programming background), writing string
manipulation routines in terms of recursive templates can present a big conceptual barrier.</para>

<para>I don't know the history of the decisions that brought this situation about. I have always
thought the statement at the start of the XSLT 1.0 specification, to the effect that XSLT is not
a general-purpose programming language, was very suggestive: committees don't put a statement
like that in a specification unless there has been a vigorous debate on the matter, and the fact that the 
statement is there means there must have been a strong "keep it simple" camp on the working group
who won the debate. Which is probably a good thing, given the length of time the world has been
waiting for an XQuery recommendation.</para>

<para>But the fact is, there is a large class of applications for which the text processing capability
in XSLT 1.0 is woefully inadequate - and this includes most up-conversion applications. XSLT 1.0 is very
good at performing structural transformations - that is, at rearranging the nodes in a tree. It is much
less good at manipulating the textual content of those nodes. By definition, up-conversion applications
are those where the input doesn't have explicit structure, but rather has structure that is implicit
in the text, and therefore they need good text processing capability.</para>

<para>Users of Perl and similar languages have long been accustomed to the power of regular
expressions (regexes). In fact, they are so powerful they can become addictive: whereas programmers from
other disciplines might turn to regular expressions as a last resort, there are Perl programmers
who see almost any problem as an opportunity for creativity in their use of regexes.</para>

<para>XPath 2.0 offers three functions in its standard function library that perform regular expression
processing. Specifically:</para>

<itemizedlist>
<listitem><para><emphasis role="bold">matches()</emphasis>: returns a boolean value
indicating whether a particular string matches a regular expression. </para></listitem>
<listitem><para><emphasis role="bold">replace()</emphasis>: replaces those substrings
within a given string that match a regular expression, with a replacement string. </para></listitem>
<listitem><para><emphasis role="bold">tokenize()</emphasis>: breaks a string into a sequence
of substrings, based on finding delimiters or separators that match a given regular expression.</para></listitem>
</itemizedlist>

<para>Conspicuously missing from this list is any function that allows markup to be inserted into a string.
It can be done somewhat laboriously by combining the different functions together, but using these
three functions alone to translate <code>See [2]</code> into <code>See &lt;ref&gt;2&lt;/ref&gt;</code> is
painfully hard work. The reason for the omission is that it's hard to solve the requirement with a simple
function.</para>

<para>The XSLT/XQuery/XPath programming model, despite the fact that it owes a great deal to
functional programming theory, does not support higher-order functions. That is, functions are not
first-class objects and cannot be supplied as arguments to other functions. This greatly limits the
power of what can be achieved with a function library alone. All higher-order capabilities in the
three languages are instead achieved by means of higher-order operators, custom syntax, or XSLT instructions.
An example is the XPath <code>for</code> expression, which in a pure functional language would be expressed
as a higher-order <code>map</code> or <code>apply</code> operator taking a sequence as its first argument
and a function (to be applied to each member of the sequence) as its second argument; another example
is the construct <code>SEQ[P]</code> which is essentially a higher-order <code>filter</code> function
that takes a sequence as its first argument and a predicate as its second.</para>

<para>So the XSLT solution to this problem is an instruction, <code>xsl:analyze-string</code>, that
logically takes four arguments: the string to be analyzed, a regex, an instruction to
be executed to process substrings that match the regular expression, and an instruction to be
executed to process substrings that don't match. The earlier example that turns 
<code>See [2]</code> into <code>See &lt;ref&gt;2&lt;/ref&gt;</code> can then be coded as follows:</para>

<programlisting><![CDATA[<xsl:analyze-string select="$input" regex="\[.*?\]">
  <xsl:matching-substring>
    <ref><xsl:value-of select="translate(.,'[]', '')"/></ref>
  </xsl:matching-substring>
  <xsl:matching-substring>
    <xsl:value-of select="."/>
  </xsl:matching-substring>
</xsl:analyze-string>]]></programlisting>

<para>Those who are comfortable with regular expressions will have little difficulty following
what <code>regex="\[.*?\]"</code> does: <code>\[</code> matches an opening square bracket, <code>.*</code> matches any sequence of 
characters, the <code>?</code> is a modifier indicating that the <code>.*</code> should match the shortest possible sequence
of characters consistent with the regex as a whole succeeding, and the <code>\]</code> matches a closing square
bracket.</para>

<para>The semantics of <code>xsl:analyze-string</code> are that the input string is scanned from left
to right looking for substrings that match the regex. Substrings that don't match the regex are passed
(as the context item, ".") to the <code>xsl:non-matching-substring</code> instruction, which in this case copies
them unchanged, while substrings that do match the regex are passed to <code>xsl:matching-substring</code>,
which in this example wraps the substring in a <code>ref</code> element, using the (XSLT 1.0) translate()
function to drop the delimiting square brackets. (Regex devotees will find a different way of doing this,
but the old translate() function suits me fine.)</para>

<para>There is no equivalent facility to <code>xsl:analyze-string</code> in XQuery. In the latest
release (version 8.1) of Saxon I have introduced an extension to support higher-order functions, and
have used this to provide an extension function <code>saxon:analyze-string</code> 
<xref linkend="saxon81"/> that takes as its
arguments the string to be processed, the regex, and two functions to be applied to the matching and
non-matching substrings respectively. It's not quite as convenient to use as the XSLT 2.0 construct, but
it demonstrates that if higher-order functions were available in the language, there would be a lot
less need for custom syntax to solve such problems.</para>
</section>

<section><title>Grouping facilities</title>

<para>Grouping problems probably form the largest category of tricky-to-solve problems faced by
XSLT 1.0 users. I classify any problem as a grouping problem if it requires the addition of an extra
layer of hierarchy in the result tree that is not present in the source tree. Grouping problems
fall essentially into two categories: those that group elements having matching data values, and those
that group elements based on their position in a sequence (for example, a <code>heading</code> element
followed by all the <code>para</code> elements up to the next <code>heading</code>).</para>

<para>XSLT 1.0 offers no inbuilt support for solving grouping problems, and neither does XQuery 1.0.
The standard solution for value-based grouping in XSLT 1.0 is a technique using keys, which was 
invented by Steve Muench of
Oracle and is therefore known as Muenchian grouping: its best description is that by Jeni Tennison at 
<xref linkend="jeni-grouping"/>.
(Steve never published it himself: he first described it in a personal email to me, and I announced
his discovery to the world. I am very pleased that he got the credit he deserved, which is unusual
in our industry.) For positional grouping, a number of techniques are possible, generally 
involving recursive processing using the following-sibling axis. (Unfortunately neither keys nor the
following-sibling axis are available in XQuery, so XQuery users are going to struggle with this one.)</para>

<para>XSLT 2.0 offers a new instruction, <code>xsl:for-each-group</code>, to perform grouping. It provides
four ways to define the grouping criterion: simple value-based grouping (the most common requirement) 
can be achieved by defining an expression to compute the grouping key, while the other 
three options define various kinds of 
positional grouping criteria. The body of the <code>xsl:for-each-group</code> instruction is then
executed once for each group of nodes identified.</para>

<para>To take a simple example, the following code takes a flat list of <code>author</code> elements,
and groups them so that authors with the same affiliation appear as children of an <code>affiliation</code>
element:</para>

<programlisting><![CDATA[<xsl:for-each-group select="author" group-by="affiliation">
  <affiliation name="{current-grouping-key()}">
    <xsl:copy-of select="current-group()"/>
  </affiliation>
</xsl:for-each-group>]]></programlisting>

<para>What is the relevance of this to up-conversion, the subject of this paper? The answer is that
up-conversion involves detection of implicit structure, and replacement of the implicit structure
by explicit markup. This is exactly what grouping facilities are doing. This time, the implicit structure
is not found by parsing the text, but by looking for patterns in the existing markup. This will become
very clear in my case study, presented in the second half of this paper.</para>

<para>Like <code>xsl:analyze-string</code>, the <code>xsl:for-each-group</code> instruction is essentially
syntactic sugar for a higher-order function. This time you can think of it (specifically the variant for
value-based grouping) as a function whose arguments are the sequence to be grouped, a function to 
calculate the grouping key, and a function to be evaluated once for each group of items in the input
sequence. So that XQuery users can take advantage of the grouping facilities in Saxon, I have again
provided a higher-order extension function in Saxon 8.1 that provides this capability: its name is
<code>saxon:for-each-group()</code> <xref linkend="saxon81"/>. As with <code>analyze-string</code>, it is slightly clumsier
to use than the custom syntax provided in XSLT 2.0, but again shows how much more power there would be
in the language if higher-order functions were a standard feature.</para>
</section>
<section><title>Schema-aware processing</title>

<para>The most radical difference between XSLT 2.0 and XSLT 1.0 is that the language has become
strongly typed, with a type system based on XML Schema. This has been done in such a way that
untyped (schemaless) processing is still possible as a fallback. There are many reasons this 
change has taken place, and much debate about the desirability of making such a radical change,
especially in view of the fact that XML Schema is widely criticized both for its complexity and
for the limitations in its capability. I would like to concentrate here, however, on its impact
for writing up-conversion applications.</para>

<para>Since up-conversion often starts with an input file that is not XML, it is unlikely that an XML
Schema will exist to describe its structure. Fortunately this is not a problem: XSLT is still perfectly
happy to work with untyped, schemaless data.</para>

<para>I have often found that it is best to structure an up-conversion as a sequence of two (maybe more)
transformations. The first transformation takes the raw input data in whatever legacy format it arrives
in, and translates it to an XML representation that is as close to the original structure as possible,
consistent with it being XML. The second transformation takes this raw XML and translates it to the
desired target XML vocabulary.</para>

<para>The target vocabulary typically represents XML that is designed to have significant 
visibility: it may be long-lived, widely-shared, or both. Therefore, it is very likely that there
will exist an XML Schema for this vocabulary. The schema-aware capabilities of XSLT that are 
relevant to up-conversion therefore tend to be those that are concerned with validating the result
tree, rather than those concerned with processing the source. In the case study I will show how
this validation assisted with the development process for creating correct XSLT transformations.
The case study in this paper is an artificial one, it was constructed largely for pedagogic purposes,
but I have had the same experiences in a real project involving the capture of human resources
data from Excel spreadsheets for transfer into an XML database.</para>

</section>
</section>

<section>
<title>An up-conversion case study: GEDCOM</title>

<para>In this second part of this paper we will look at how the constructs introduced in the
previous section are used in a practical example of an up-conversion exercise.</para>

<section><title>Description of the Problem</title>

<para>Genealogical data is interesting for a number of reasons. Genealogy is one of the most
popular applications of the web for millions of people, and its success relies on the ability to
exchange data between different application packages. The data itself is quite complex, for two
reasons: the variety of information that people want to record, and the need to capture uncertain
information and conflicting versions of events. For many years genealogical data has been exchanged
using a format called GEDCOM <xref linkend="gedcom55"/>, 
devised by the Church of Jesus Christ of Latter-Day Saints (the Mormons).
GEDCOM 5.5 uses a hierarchic record format rather in the style of a COBOL data definition, 
typified by the following entry:</para>

<programlisting><![CDATA[0 @I53@ INDI
1 NAME Michael Howard /KAY/
1 SEX M
1 BIRT
2 DATE 11 OCT 1951
2 PLAC Hannover, Germany
3 MAP
4 LATI N52
4 LONG E9
1 OCCU Software Designer
2 DATE FROM 1975 TO 2004
1 EDUC Postgraduate
2 DATE FROM 1969 TO 1975
2 PLAC Cambridge, England
3 MAP
4 LATI N52
4 LONG E0
2 NOTE PhD in Computer Science
1 FAMS @F233@
1 FAMC @F221@]]></programlisting>

<para>The <code>@I53@</code> field is a record identifier, and the values <code>@F233@</code> and
<code>@F221@</code> are pointers to other records (specifically, the record describing the family 
in which this individual is a parent, and the record describing the family in which this individual
is a child).</para> 

<para>This can of course be directly translated to an XML syntax, such as this:</para>

<programlisting><![CDATA[<INDI>
  <NAME>Michael Howard /KAY/</NAME>
  <SEX>M</SEX>
  <BIRT>
    <DATE>11 OCT 1951</DATE>
    <PLAC>Hannover, Germany
      <MAP>
        <LATI>N52</LATI>
        <LONG>E9</LONG>
      </MAP>
    </PLAC>
  </BIRT>
  <OCCU>Software Designer
    <DATE>FROM 1975 TO 2004</DATE>
  </OCCU>
  <EDUC>Postgraduate
    <DATE>FROM 1969 TO 1975</DATE>
    <PLAC>Cambridge, England
      <MAP>
        <LATI>N52</LATI>
        <LONG>E0</LONG>
      </MAP>
    </PLAC>
    <NOTE>PhD in Computer Science</NOTE>
  </EDUC>
  <FAMS REF="F233"/>
  <FAMC REF="F221"/>
</INDI>]]></programlisting>

<para>The first stage of our up-conversion application will be to convert the data into this form. After
that we will see how to convert it further to the actual target XML vocabulary defined by the
proposed GEDCOM-XML standard.</para>

</section>

<section><title>Stage One: Conversion to Raw XML</title>

<para>In my book <emphasis role="italic">XSLT Programmer's Reference</emphasis> (including the latest
edition for XSLT 2.0) I describe how to perform this step by writing a GEDCOM parser in Java. The fact 
is, however, that it can be coded entirely in XSLT 2.0, and that the XSLT 2.0 code is actually shorter
than the Java implementation. Let's see what it looks like.</para>

<para>First we have to read the input file, which we can do like this:</para>

<programlisting><![CDATA[<xsl:param name="input" as="xs:string" required="yes"/> 

<xsl:variable name="input-text" 
              as="xs:string" 
              select="unparsed-text($input, 'iso-8859-1')"/>]]></programlisting>
                     
<para>(I've actually cheated here. GEDCOM requires files to be encoded in a character set
called ANSEL, otherwise ANSI Z39.47-1985, which is used for almost no other purpose. If ANSEL
were a mainstream character encoding, it could be specified in the second argument of the
<code>unparsed-text()</code> function call. In practice, however, it is rather unlikely that
any XSLT 2.0 processor would support this encoding natively. Therefore, the conversion from
ANSEL to a mainstream character encoding will still have to be done in a pre-processing phase.)</para>

<para>The next stage is to split the input into lines, which can be done using the XPath 2.0
<code>tokenize()</code> function. Since the <code>unparsed-text()</code> function does not
normalize line endings (this might yet change) the regular expression for matching the separator
between tokens accepts both UNIX and Windows line endings. The result is a sequence of strings, one
for each line of the input file:</para>

<programlisting><![CDATA[<xsl:variable name="lines" 
              as="xs:string*" 
              select="tokenize($input-text, '\r?\n')"/>]]></programlisting>
              
<para>Now we need to parse the individual lines. Each line in a GEDCOM file has up to five
fields: a level number, an identifier, a tag, a cross-reference, and a value. We will create
an XML <code>line</code> element representing the contents of the line, using attributes to represent each of 
these five components:</para>

<programlisting><![CDATA[<xsl:variable name="parsed-lines 
              as="element(line)*">
  <xsl:for-each select="$lines">
    <xsl:analyze-string select="." flags="x"
                        regex="^([0-9]+)\s*
                              (@([A-Za-z0-9]+)@)?\s*
                              ([A-Za-z]*)?\s*
                              (@([A-Za-z0-9]+)@)?
                              (.*)$"> 
      <xsl:matching-substring>
        <line level="{regex-group(1)}"
              ID="{regex-group(3)}"
              tag="{regex-group(4)}"
              REF="{regex-group(6)}"
              text="{regex-group(7)}"/>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:message>
          Non-matching line "<xsl:value-of select="."/>"
        </xsl:message>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:for-each>
</xsl:variable>]]></programlisting>

<para>Note first the <code>as</code> attribute on the <code>xsl:variable</code> declaration.
I have consistently been declaring the types of my variables: this helps to pick up programming
errors and it documents the stylesheet for the reader. I can do this even with a non-schema-aware
stylesheet: the form <code>element(line)*</code> indicates that the variable holds a sequence
of elements whose name is <code>line</code>. I could further constrain them to conform to a
<code>line</code> element declaration in an XML schema by writing <code>schema-element(line)*</code>,
but I've chosen not to do that here, because it's too much effort to create a schema to describe
this transient data structure.</para>

<para>The actual content of the elements is constructed by analyzing the text of the input GEDCOM
line using a regular expression. The attribute <code>flags="x"</code> allows the regex to be split
into multiple lines for readability. The five lines of the regex correspond to the five fields that
may be present. I describe this usage of <code>xsl:analyze-string</code> as a "single-match" usage,
because the idea is that the regular expression matches the entire input string exactly once, and the
<code>xsl:non-matching-substring</code> instruction is used only to catch errors. Within the
<code>xsl:matching-substring</code> instruction, the content of the line is picked apart using the
<code>regex-group()</code> function, which returns the part of the matching substring that matched
the n'th parenthesized subexpression within the regex. If the relevant part of the regex wasn't matched
(for example, if the optional identifier was absent) then this returns a zero-length string, and our
XSLT code then creates a zero-length attribute.</para>

<para>So we now have a sequence of XML elements each representing one line of the GEDCOM file, 
each containing attributes to represent the contents of the five fields in the input. The next stage
is to convert this flat sequence into a hierarchy, in which level 2 lines (for example) turn into
XML elements that contain the corresponding level 3 lines.</para>

<para>Any problem that involves adding hierarchic levels to the result tree, that were not present
in the source tree, can be regarded as a grouping problem, and it should therefore be no surprise
that we tackle it using the <code>xsl:for-each-group</code> instruction. This time a group consists of
a level N element together with the following elements up to the next one at level N. So this is
a positional grouping rather than a value-based grouping. The option that we use to tackle this is
the <code>group-starting-with</code> attribute, whose value is a match pattern that is used to
recognize the first element in each group.</para>

<para>A single application of <code>xsl:for-each-group</code> creates one extra level in the result
tree. In this example, we have a variable number of levels, so we want to apply the instruction
a variable number of times. First we group the overall sequence of <code>line</code> elements so that
each level 0 line starts a new group. Within this group, we perform a further grouping so that each
level 1 line starts a new group, and so on up to the maximum depth of the hierarchy. As one might expect,
the process is recursive: we write a recursive template that performs the grouping at level N, and that
calls itself to perform the level N+1 grouping. This is what it looks like:</para>

<programlisting><![CDATA[<xsl:template name="process-level">
  <xsl:param name="population" required="yes" as="element()*"/>
  <xsl:param name="level" required="yes" as="xs:integer"/>
  <xsl:for-each-group select="$population" 
       group-starting-with="*[xs:integer(@level) eq $level]">
    <xsl:element name="{@tag}">
      <xsl:copy-of select="@ID[string(.)], @REF[string(.)]"/>
      <xsl:value-of select="normalize-space(@text)"/>
      <xsl:call-template name="process-level">
        <xsl:with-param name="population" 
                        select="current-group()[position() != 1]"/>
        <xsl:with-param name="level" 
                        select="$level + 1"/>
      </xsl:call-template>
    </xsl:element>
  </xsl:for-each-group>
</xsl:template>]]></programlisting>                  

<para>When this is called to process all the <code>line</code> elements with
the <code>$level</code> parameter set to zero, it forms one group for each
line having the attribute <code>level="0"</code>, containing that line and
all the following lines up to the next one with <code>level="0"</code>. It
then processes each of these groups by creating an element to represent the
level 0 line (the name of this element is taken from the GEDCOM tag, and its
ID and IDREF attributes are copied unless they are empty), and constructs the
content of this new element by means of a recursive call, processing all elements
in the group except the first, and looking this time for level 1 lines as the
ones that start a new group. The process continues until there are no lines at the
next level (the <code>for-each-group</code> instruction does nothing if the
population to be grouped is empty).</para>

<para>The remaining code in the stylesheet simply invokes this recursive template
to process all the lines at level 0:</para>

<programlisting><![CDATA[<xsl:template name="main">
  <xsl:call-template name="process-level">
    <xsl:with-param name="population" 
                    select="$parsed-lines/ged/line"/>
    <xsl:with-param name="level" 
                    select="0"/>
  </xsl:call-template>
</xsl:template>]]></programlisting>

<para>This <code>main</code> template represents the entry point to the stylesheet.
There is no <code>match="/"</code> template rule, because there is no source XML document
with a root node to be matched; instead, XSLT 2.0 allows a transformation to be 
invoked by specifying the name of a named template where execution is to start.
I use the name <code>main</code> as a matter of convention.</para> 

<para>We have now converted the GEDCOM data to XML. The next step is to convert it to
the actual XML vocabulary that the target application requires.</para>

</section>

<section><title>Stage Two: Conversion to the Target Schema</title>

<para>Like many up-conversion problems, the GEDCOM problem is best solved in two stages:
the first stage is essentially a syntactic transformation of the raw data into XML, and the
second stage is a semantic transformation to a different data model.</para>

<para>At the same time as moving to XML, the GEDCOM designers decided it was time to fix
some long-standard deficiencies in the data model. The draft GEDCOM 6.0 specification 
<xref linkend="gedcom60"/>
therefore not only moves from ANSEL character encoding to Unicode, and from COBOL-like
level numbers to nested XML tags, it also changes the structure of the data. Events, for example,
are now primary objects in their own right, rather than being always subsidiary to an individual
or family. This reflects the fact that there is often uncertainty as to whether two events involve
the same individual (rather than two distinct individuals having the same name), and it also makes
it easier to record all the individuals associated with an event - for example, the witnesses
at a marriage, or the godparents at a christening.</para>

<para>The transformation of GEDCOM 5.5 files to "raw XML", as described in the previous section,
is therefore followed by a second transformation, this time to XML that conforms to the target
schema defined by GEDCOM 6.0. (I'm taking it as read here that GEDCOM 6.0 exists and is stable and
is worth adopting as a target. This idealizes the actual state of affairs, but the debate isn't relevant to
this paper.)</para>

<para>Multi-phase transformations can be done in either of two ways: using a single stylesheet 
(typically using different modes for the two phases) or using one stylesheet for each phase. 
I usually find
it is easier to develop them using multiple stylesheets, and then integrate them together later
as a production application.</para>

<para>The second transformation is rather more conventional than the first, because it starts with 
XML as its input. I've presented the full stylesheet in <emphasis role="italic">XSLT 2.0 Programmer's
Reference</emphasis>, and I won't repeat it here in full. What I would like to draw out, however, 
is the impact of making this stylesheet schema-aware.</para>

<para>The first stylesheet, presented in the previous section, didn't use an XML schema. The input
isn't XML, so it clearly has no schema; and the output uses a local transient XML vocabulary where
the effort of writing a schema probably isn't worthwhile. However, for the second stylesheet, the aim
is to produce output that conforms to a recognized standard XML vocabulary, for which an XML Schema
exists, and we clearly want to have as much confidence as we can that the stylesheet output will always
conform to this target schema.</para>

<para>With XSLT 1.0, the way you achieve this is to run your stylesheet against as many test cases as
you can, and validate the output of each test case against the target schema. If validation errors are reported,
you then have to debug the stylesheet to find out why it produced incorrect output in this particular case.</para>

<para>It would be far better if one could determine statically, purely from examination of the stylesheet,
that its output will be correct. In practice this is unlikely to be fully achievable, because of the
highly dynamic nature of XSLT template rules. However, there are many errors that could in principle
be detected statically, and each error that is found this way makes a significant contribution to easing
the testing and debugging burden. For example, here is an extract of the second-phase GEDCOM stylesheet:</para>

<programlisting><![CDATA[   
  <xsl:result-document validation="strict">
     <GEDCOM>
        <HeaderRec>
          <FileCreation Date="{format-date(current-date(), 
                               '[D1] [MN,*-3] [Y0001]')}"/>
          <Submitter>
             <Link Target="ContactRec" Ref="Contact-Submitter"/>
          </Submitter>
        </HeaderRec>
        <xsl:call-template name="families"/>
        <xsl:call-template name="individuals"/>
        <xsl:call-template name="events"/>
        <ContactRec Id="Contact-Submitter">
          <Name><xsl:value-of select="$submitter"/></Name>
        </ContactRec>
     </GEDCOM>
  </xsl:result-document>]]></programlisting>         

<para>One can see many potential errors that could be detected statically by the stylesheet 
compiler. It can check that there is a schema definition of the <code>GEDCOM</code> element, and
that <code>HeaderRec</code> and <code>ContactRec</code> are permitted respectively as the 
first and last child elements of the <code>GEDCOM</code> element. It can check similarly
that the elements within the <code>HeaderRec</code> are allowed to appear where they do, that they
are allowed to have the appropriate attributes, and that none of these elements have required
attributes which the stylesheet does not generate. In some cases the compiler can also check that the
textual content of elements and attributes is appropriate to their type. The analysis can extend
beyond the fragment shown here to the three named templates invoked by this fragment; for example
if the call on the <code>individuals</code> template preceded that on the <code>families</code>
template, then the compiler could deduce that the stylesheet was outputting <code>IndividualRec</code>
elements ahead of <code>FamilyRec</code> elements, which the schema does not allow.</para>

<para>As programmers, we are all familiar with the fact that errors detected at compile-time 
are much quicker to find and to fix than errors detected at run-time. This is as true for XSLT
as for any other programming language.</para>

<para>Currently the only schema-aware XSLT processor available is my own Saxon product, and the
current release (8.0) does not yet do the kind of static checking described above. Even run-time
checking, however, can pay substantial dividends. For example, one error that I made during development was
to write an attribute of a literal result element as <code>id="@ID"</code> instead of <code>id="{@ID}"</code>.
Ordinarily, this would cause the result document to contain the attribute value <code>id="@ID"</code>.
When the programmer gets round to validating the output (a stage which is often omitted during development
and testing) this would reveal an error, because the <code>id</code> attribute is declared as having
type <code>xs:ID</code>, and an <code>@</code> character is not allowed in values of this type.
Running with a schema-aware processor, this error was reported as soon as the offending code in the
stylesheet was executed, with the incorrect line in the stylesheet being accurately pinpointed.</para>

<para>I actually found that while developing this and other similar stylesheets, the number of errors
detected by validation of result trees was so large that it became a little frustrating. Sometimes one
actually wants to develop a stylesheet "top-down", getting the broad structure of the output right first,
and focusing on the detail later. As a response to this experience, Saxon 8.1 allows multiple validation
errors in the output to be reported in a single run, and it allows you to see the (invalid) result tree
that was generated, along with comments inserted into the XML showing where it is invalid and which
stylesheet instructions need to be changed to fix the errors. This provides another of the benefits
normally associated with compile-time errors, the ability to report many errors in a single run.</para>

<para>Like other new features in XSLT 2.0, such as <code>xsl:analyze-string</code> and 
<code>xsl:for-each-group</code>, the facility to validate result documents on-the-fly is 
useful for a wide range of applications, of which up-conversion applications are just one example.
But taken together, these features make a dramatic difference to the ease of developing
up-conversion applications when compared with XSLT 1.0.</para>  
       
</section>
</section>
<section><title>Conclusions</title>

<para>The first part of this paper described four specific features of XSLT 2.0 that make it
highly suitable for writing up-conversion applications, namely:</para>

<itemizedlist>
<listitem><para>The unparsed-text() function</para></listitem>
<listitem><para>Regular expression processing</para></listitem>
<listitem><para>Grouping facilities</para></listitem>
<listitem><para>Schema-aware processing</para></listitem>
</itemizedlist>

<para>The second half of the paper showed how these features can be used in a practical
up-conversion exercise, the translation of GEDCOM 5.5 genealogical data to the proposed
GEDCOM 6.0 XML vocabulary.</para>

<para>XSLT 1.0 has been widely deployed to achieve both XML-to-XML and XML-to-text 
transformations. The conclusion of this paper is that XSLT 2.0 is also highly suited to
a wide range of text-to-XML applications, thus greatly increasing the scope of applicability
of the language.</para>
</section>

<bibliography>
<bibliomixed id="xslt20"><abbrev>XSLT 2.0</abbrev><citetitle> <ulink url="http://www.w3.org/TR/xslt20/">XSL Transformatons (XSLT) Version 2.0</ulink>.
W3C Working Draft 12 November 2003.</citetitle>
</bibliomixed>

<bibliomixed id="xpath20"><abbrev>XPath 2.0</abbrev><citetitle> <ulink url="http://www.w3.org/TR/xpath20/">XML Path Language (XPath) 2.0</ulink>.
W3C Working Draft 23 July 2004.</citetitle>
</bibliomixed>

<bibliomixed id="xslt20pr"><abbrev>Kay, 2004a</abbrev>
<citetitle>Michael Kay, XSLT 2.0 Programmer's Reference, 3rd edition. </citetitle> Michael Kay. 
	<publishername>Wiley</publishername>, <pubdate>2004</pubdate>
</bibliomixed>

<bibliomixed id="xpath20pr"><abbrev>Kay, 2004b</abbrev>
<citetitle>Michael Kay, XPath 2.0 Programmer's Reference. </citetitle> Michael Kay. 
	<publishername>Wiley</publishername>, <pubdate>2004</pubdate>
</bibliomixed>

<bibliomixed id="jeni-grouping"><abbrev>Tennison</abbrev><citetitle> <ulink url="http://www.jenitennison.com/xslt/grouping">Jeni's XSLT Pages: Grouping</ulink>. </citetitle>
Jeni Tennison.</bibliomixed>

<bibliomixed id="gedcom55"><abbrev>LDS, 1996 </abbrev><citetitle> <ulink url="http://homepages.rootsweb.com/~pmcbride/gedcom/55gctoc.htm">The GEDCOM Standard Release 5.5</ulink>. </citetitle>
Family History Department, The Church of Jesus Christ of Latter-day Saints.
<pubdate>January 2, 1996</pubdate></bibliomixed>

<bibliomixed id="gedcom60"><abbrev>LDS, 2002 </abbrev><citetitle> <ulink url="http://www.familysearch.org/GEDCOM/GedXML60.pdf">GEDCOM XML Specification, Release 6.0, Beta Version</ulink>. </citetitle>
Family and Church History Department, The Church of Jesus Christ of Latter-day Saints.
<pubdate>December 6, 2002</pubdate></bibliomixed>

<bibliomixed id="saxon81"><abbrev>Saxonica, 2004</abbrev><citetitle> <ulink url="http://www.saxonica.com/">Saxon 8.1 Documentation</ulink>. </citetitle>
Go to www.saxonica.com, follow links to Documentation, then Extensions, then
Extension Functions.
<publishername>Saxonica Limited</publishername>, <pubdate>to be published</pubdate></bibliomixed>
</bibliography>

</article>

