XML Schema 1.0 implementation

The command line interface com.saxonica.Validate has been completely redesigned, allowing multiple schema documents to be loaded and multiple instance documents to be validated.

This release of Saxon introduces preliminary support for assertions in a schema, based on the current (31 August 2006) draft of XML Schema version 1.1. This allows a complex type to contain an assertion about the content of the corresponding element expressed as an arbitrary XPath 2.0 expression. Please note that this facility in the Working Draft is likely to change, and the Saxon implementation will change accordingly. For further details see Assertions.

The XML Schema specification imposes a rule that when one type R is derived from another type B by restriction, then every element particle ER in the content model of R must be compatible with the corresponding element particle EB in B. One aspect of this is that the identity constraints defined in the declaration of ER (that is, unique, key, and keyref) must be a superset of the constraints defined for EB. The specification doesn't say how to decide whether two constraints are equivalent for this purpose, and Saxon has previously ignored this requirement. At this release a check is introduced which partially implements the rule. Specifically, Saxon will count the number of constraints that are defined, and will report an error if EB has more constraints of any particular kind (unique, key, or keyref) than ER has. If EB has at least one constraint and ER has one or more, then Saxon will output a warning saying that it was unable to check whether the constraints were compatible with each other.

It is now possible when requesting validation of an instance to specify the required name of the top-level element in the document being validated. This is possible through the option -top:clarkname on the com.saxonica.Validate command, or via a new property on the AugmentedSource object. The property is also available on the DocumentBuilder in the .NET API and in the new s9api Java API. A validation error occurs if the document being validated has a top-level element with a different name.

I discovered that Saxon allows you to use the types xs:dayTimeDuration and xs:yearMonthDuration in a schema as built-in types. XML Schema 1.0 doesn't recognize these types (though I can't find a rule that says it is absolutely non-conformant to accept them). I have changed the code to give an interoperability warning if they are used. I have also disallowed the use of the type xs:anyAtomicType, which has no defined validation semantics.

The mechanisms for comparing values in the course of schema validation and processing have now been separated completely from the mechanisms used when implementing XPath operators. This means that the semantics of comparison and ordering should now follow the XML Schema specification precisely. Previously some operations were implemented according to the XPath semantics.

A duplicate xsi:schemaLocation or xsi:noNamespaceSchemaLocation attribute is now ignored (previously it was rejected under the rule that such an attribute cannot appear after the first element in the relevant namespace). Duplicates can arise naturally from XInclude processing, so they are now accepted and ignored. The schema specification permits this but does not require it. To be considered duplicates, the declarations must match in the namespace URI and in the absolutized schemaLocation URI.

Result tree validation

Saxon now does more extensive compile-time checking where an xsl:document or xsl:result-document instruction requests validation of the result tree. This means that validation errors that were previously detected at stylesheet execution time are now sometimes detected at compile time. Previously these checks were only done when validation was requested on an element-constructor instruction.

Expansion of attribute and element defaults

When the input or output of a query or transformation is validated, it is now possible to request that fixed and default element and attribute values defined in the schema should not be expanded. This is done using the option -expand:off on the command line, or equivalent options in the TransformerFactory and Configuration APIs.

The same option also applies to DTD-based attribute default expansion, provided that the XML parser reports sufficient information to the application.

Serializing a Schema Component Model

It is now possible to export the contents of the schema cache held in the Configuration object to an XML file (with the conventional extension .scm for Schema Component Model). The contents can subsequently be reloaded. This is faster than reloading the original source schema documents, because it allows most of the validation to be skipped, along with the sometimes expensive operation of constructing and determinizing finite state machines. This facility is intended to be used in conjunction with XQuery Java code generation: it allows the schemas that were imported by a compiled query to be saved on disk alongside the compiled query itself, for rapid reloading at run time.

The serialized SCM file is also designed to be easy for applications to process. The representation of schema components is more uniform than in source .xsd documents (there are fewer defaults, and fewer alternative ways of expressing the same information). This makes it a suitable representation for applications that need to process or analyze schema information, as an alternative to using the Java API.

This has proved useful within Saxon itself. Saxon's schema analyzer was previously written using ad-hoc parsing techniques to validate schemas against the rules defined in the schema-for-schemas. The addition of assert and report elements threatened to make this even more complex. So a simple XSLT transformation was written to take the finite state machines in the SCM version of the schema-for-schemas and generate Java code from them. This means that Saxon's schema validation logic is now derived directly from the published schema-for-schemas, while retaining the efficiency of hard-coded Java.

Changes to the Schema Component Model API

Changes have been made to the API for the schema component model (package com.saxonica.schema) to align it more closely with the abstract model defined in the W3C specifications.

All named components now consistently expose methods getName() and getTargetNamespace() to provide access to the local part of the name and the namespace URI respectively. The wide variety of existing names for these accessors have been retained for the time being as deprecated methods. The new names are chosen because they correspond to the names used for these properties in the W3C schema component model.

The class FacetCollection has disappeared; its functionality has been merged into UserSimpleType.

The class Compositor has been renamed ModelGroup, and its subclasses such as ChoiceCompositor have been renamed accordingly. In the W3C schema model, the compositor (all, choice, sequence) is one of the properties of the ModelGroup. This is now available using the method getCompositorName() on the ModelGroup object.

Particle is now an abstract class rather than an interface, and the previous abstract class AbstractParticle no longer exists. There are three subclasses of Particle, namely ElementParticle, ElementWildcard, and ModelGroupParticle. This means there is now a destinction between the ModelGroupParticle, which represents a reference to a ModelGroup, and the ModelGroup itself. The class ModelGroupDefinition (which represents a named model group) no longer implements Particle; it is now a subclass of ModelGroup.

The class ModelGroupParticle replaces GroupReference; it is no longer necessarily a reference to a (named) ModelGroupDefinition, but now can be a reference to any (named or unnamed) ModelGroup.

ElementWildcard and AttributeWildcard are no longer subclasses of Wildcard; instead Wildcard is now a helper class to which these two classes delegate. Instead, ElementWildcard is now a subclass of Particle. The getTerm() method of ElementWildcard returns the Wildcard object (previously it returned the ElementWildcard object itself).

The use of exceptions SchemaException and ValidationException has been made more consistent. A SchemaException indicates that the schema is invalid, and should occur only while the schema is being loaded and validated. A ValidationException indicates that an instance document is invalid against the schema, and should occur only during instance validation. Errors relating to the consistency of a stylesheet or query against a valid schema should result in an XPathException being thrown. An inconsistency in the schema found during instance validation is an internal error, and should result in an IllegalStateException, except for unresolved references to missing schema components (which is defined in the schema spec not to constitute a schema invalidity), which results in an UnresolvedReferenceException. Because it can occur almost anywhere, UnresolvedReferenceException is an unchecked exception.