saxonica.com

Collections

Saxon implements the collection() function by passing the given URI (or null, if the default collection is requested) to a user-provided CollectionURIResolver. This section describes how the standard collection resolver behaves, if no user-written collection resolver is supplied.

The default collection resolver returns the empty sequence as the default collection. The only way of specifying a default collection it to provide your own CollectionURIResolver.

If a collection URI is provided, Saxon attempts to dereference it. What happens next depends on whether the URI identifies a file or a directory.

Using catalog files

If a file is identified, Saxon treats this as a catalog file. This is a file in XML format that lists the documents comprising the collection. Here is an example of such a catalog file:


<collection stable="true">
  <doc href="dir/chap1.xml"/>
  <doc href="dir/chap2.xml"/>
  <doc href="dir/chap3.xml"/>
  <doc href="dir/chap4.xml"/>
</collection>

The stable attribute indicates whether the collection is stable or not. The default value is true. If a collection is stable, then the URIs listed in the doc elements are treated like URIs passed to the doc() function. Each URI is first looked up in the document pool to see if it is already loaded; if it is, then the document node is returned. Otherwise the URI is passed to the registered URIResolver, and the resulting document is added to the document pool. The effect of this process is firstly, that two calls on the collection() function passing the same collection URI will return the same nodes each time, and secondly, that these results are consistent with the results of the doc() function: if the document-uri() of a node returned by the collection() function is passed to the doc() function, the original node will be returned. If stable="false" is specified, however, the URI is dereferenced directly, and the document is not added to the document pool, which means that a subsequent retrieval of the same document will not return the same node.

Processing directories

If the URI passed to the collection() function (still assuming a default CollectionURIResolver) identifies a directory, then the contents of the directory are returned. Such a URI may have a number of query parameters, written in the form file:///a/b/c/d?keyword=value;keyword=value;.... The recognized keywords and their values are as follows:

keyword

values

effect

recurse

yes | no (default no)

determine whether subdirectories are searched recursively

strip-space

yes | ignorable | no

determines whether whitespace text nodes are to be stripped. The default depends on the Configuration settings.

validation

strip | preserve | lax | strict

determines whether and how schema validation is applied to each document. The default depends on the Configuration settings.

select

file name pattern

determines which files are selected (see below)

on-error

fail | warning | ignore

determines the action to be taken if one of the documents cannot be successfully parsed

parser

Java class name

class name of the Java XMLReader to be used. For example, John Cowan's TagSoup parser may be selected by specifying parser=org.ccil.cowan.tagsoup.Parser (this parses arbitrary ill-formed HTML and presents it to Saxon as well-formed XML).

xinclude

yes | no

determines whether XInclude processing should be applied to the selected documents. This overrides any setting in the Configuration (or any command line option).

unparsed

yes | no (default no)

determine whether the files contain unparsed text. If unparsed=yes is specified, the files are read as text using the platform default encoding. An error occurs if they contain characters that are not legal in XML. The parameters that are specific to XML, such as strip-space, parser, and validation are ignored. The function returns a document node representing each file; the document node holds a single text node containing the file contents, and the document-uri() function returns the URI of the corresponding file.

The pattern used in the select parameter can take the conventional form, for example *.xml selects all files with extension "xml". More generally, the pattern is converted to a regular expression by prepending "^", appending "$", replacing "." by "\.", and replacing "*" by ".*", and it is then used to match the file names appearing in the directory using the Java regular expression rules. So, for example, you can write ?select=*.(xml|xhtml) to match files with either of these two file extensions. Note however, that special characters used in the URL (that is, characters with a special meaning in regular expressions) may need to be escaped using the %HH convention. For example, vertical bar needs to be written as %7C. This escaping can be achieved using the iri-to-uri() function.

A collection read in this way is not stable. Calling the collection() function again with the same URI will reprocess the directory, and return a different set of document nodes, even if the contents of the directory have not changed.

Registered Collections

On the .NET product there is a third way to use a collection URI (provided that you use the API rather than the command line): you can register a collection using the Processor.RegisterCollection method on the Saxon.Api.Processor class.

Next