Saxon Documentation

Full Contents

About Saxon

Changes in this Release

Conditions of Use

Using XSLT 2.0

Using XQuery

Handling Source Documents
	Handling Source Documents
	Source Documents on the Command Line
»	Collections
	Building a Source Document from an application
	Writing input filters
	XInclude processing
	Saxon and XML 1.1
	JAXP Source Types
	Third-party Object Models: DOM, JDOM, XOM, and DOM4J
	Choosing a Tree Model
	The PTree File Format
	Validation of Source Documents
	Whitespace Stripping in Source Documents
	Streaming of Large Documents

XML Schema Processing

XPath API for Java

Saxon on .NET

Extensibility

Saxon Extensions

Sample Saxon Applications

The Saxon SQL Extension

XSLT Elements

XPath 2.0 Expression Syntax

Function Library

Standards Conformance

Collections

Saxon implements the collection() function by passing the given URI (or null, if the default collection is requested) to a user-provided CollectionURIResolver. This section describes how the standard collection resolver behaves, if no user-written collection resolver is supplied.

The default collection resolver returns the empty sequence as the default collection. The only way of specifying a default collection it to provide your own CollectionURIResolver.

If a collection URI is provided, Saxon attempts to dereference it. What happens next depends on whether the URI identifies a file or a directory.

Using catalog files

If a file is identified, Saxon treats this as a catalog file. This is a file in XML format that lists the documents comprising the collection. Here is an example of such a catalog file:


<catalog stable="true">
  <doc href="dir/chap1.xml"/>
  <doc href="dir/chap2.xml"/>
  <doc href="dir/chap3.xml"/>
  <doc href=">dir/chap4.xml"/>
</catalog>

The stable attribute indicates whether the collection is stable or not. The default value is true. If a collection is stable, then the URIs listed in the doc elements are treated like URIs passed to the doc() function. Each URI is first looked up in the document pool to see if it is already loaded; if it is, then the document node is returned. Otherwise the URI is passed to the registered URIResolver, and the resulting document is added to the document pool. The effect of this process is firstly, that two calls on the collection() function passing the same collection URI will return the same nodes each time, and secondly, that these results are consistent with the results of the doc() function: if the document-uri() of a node returned by the collection() function is passed to the doc() function, the original node will be returned. If stable="false" is specified, however, the URI is dereferenced directly, and the document is not added to the document pool, which means that a subsequent retrieval of the same document will not return the same node.

Processing directories

If the URI passed to the collection() function (still assuming a default CollectionURIResolver) identifies a directory, then the contents of the directory are returned. Such a URI may have a number of query parameters, written in the form file:///a/b/c/d?keyword=value;keyword=value;.... The recognized keywords and their values are as follows:

keyword	values	effect
recurse	yes \| no (default no)	determine whether subdirectories are searched recursively
strip-space	yes \| ignorable \| no	determines whether whitespace text nodes are to be stripped. The default depends on the Configuration settings.
validation	strip \| preserve \| lax \| strict	determines whether and how schema validation is applied to each document. The default depends on the Configuration settings.
select	file name pattern	determines which files are selected (see below)
on-error	fail \| warning \| ignore	determines the action to be taken if one of the documents cannot be successfully parsed
parser	Java class name	class name of the Java XMLReader to be used. For example, John Cowan's TagSoup parser may be selected by specifying `parser=org.ccil.cowan.tagsoup.Parser` (this parses arbitrary ill-formed HTML and presents it to Saxon as well-formed XML).
xinclude	yes \| no	determines whether XInclude processing should be applied to the selected documents. This overrides any setting in the `Configuration` (or any command line option).

The pattern used to select files can take the conventional form, for example *.xml selects all files with extension "xml". More generally, the pattern is converted to a regular expression by prepending "^", appending "$", replacing "." by "\.", and replacing "*" by ".*", and it is then used to match the file names appearing in the directory using the Java regular expression rules. So, for example, you can write *.(xml|xhtml) to match files with either of these two file extensions. Note however, that special characters used in the URL may need to be escaped using the %HH convention: this can be achieved using the iri-to-uri() function.

A collection read in this way is not stable. Calling the collection() function again with the same URI will reprocess the directory, and return a different set of document nodes, even if the contents of the directory have not changed.