Collections

Saxon implements the collection() and uri-collection() functions by passing the given collection URI (or null, if the default collection is requested) to a user-provided CollectionFinder. This section describes how the standard (default) collection finder behaves, if no user-written collection resolver is supplied.

The default collection can be registered with the Configuration in the form of a collection URI. When the collection() function is called with no arguments, this is exactly the same as supplying this default collection URI. If no default collection URI has been registered, an empty collection is returned.

The standard collection finder supports four different kinds of collection: registered collections, catalog-based collections, directory-based collections, and zip-based collections:

From Saxon 9.7, provided XPath 3.1 is enabled, collections can return any kind of items (not only nodes, as previously). Saxon by default recognizes four kids of resource: XML documents, JSON documents, unparsed text documents, and binary files. The standard collection resolver attempts to identify which kind of resource to use based on the content type (media type), which in turn may be inferred from HTTP headers, from sniffing the initial bytes of the content, or from file extensions.

In the case of directory-based and ZIP-based collections, query parameters may be added to the collection URI to further control how it is to be processed.

Defining a collection using a catalog file

If the collection URI identifies a file, Saxon treats this as a catalog file. This is a file in XML format that lists the documents comprising the collection. Here is an example of such a catalog file:

<collection stable="true"> <doc href="dir/chap1.xml"/> <doc href="dir/chap2.xml"/> <doc href="dir/chap3.xml"/> <doc href="dir/chap4.xml"/> </collection>

The stable attribute indicates whether the collection is stable or not. The default value is true. If a collection is stable, then the URIs listed in the doc elements are treated like URIs passed to the doc() function. Each URI is first looked up in the document pool to see if it is already loaded; if it is, then the document node is returned. Otherwise the URI is passed to the registered URIResolver, and the resulting document is added to the document pool. The effect of this process is firstly, that two calls on the collection() function passing the same collection URI will return the same nodes each time, and secondly, that these results are consistent with the results of the doc() function: if the document-uri() of a node returned by the collection() function is passed to the doc() function, the original node will be returned. If stable="false" is specified, however, the URI is dereferenced directly, and the document is not added to the document pool, which means that a subsequent retrieval of the same document will not return the same node.

Processing directories

If the URI passed to the collection() function (still assuming a default CollectionURIResolver) identifies a directory, then the contents of the directory are returned. Such a URI may have a number of query parameters, written in the form file:///a/b/c/d?keyword=value;keyword=value;.... The recognized keywords and their values are as follows:

keyword

values

effect

recurse

yes | no (default no)

Determines whether subdirectories are searched recursively.

strip-space

yes | ignorable | no

Determines whether whitespace text nodes are to be stripped. The default depends on the Configuration settings.

validation

strip | preserve | lax | strict

Determines whether and how schema validation is applied to each document. The default depends on the Configuration settings.

select

file name pattern ("glob")

Determines which files are selected (see below).

match

regular expression

Determines which files are selected (see below).

metadata

yes | no

If set to yes, the item returned by the collection() function will be a map containing properties of the selected resource as well as its content. The keys of the map will be strings. Two entries with names "name" and "fetch" will always be available.

The value of the "fetch" entry is a function that can be called to retrieve the content (it returns the same item that would have been returned with the default setting of metadata=no: for example a node representing an XML document, or a map representing the content of a JSON file). This allows you to decide which items in the collection to fetch based on their properties, for example:

for $m in collection('/data/folder?metadata=yes') return if ($m?content-type='application/xml') then $m?fetch() else ()

Failures in parsing a resource can be trapped by using try/catch around the call on the fetch function.

Other entries in the returned map represent properties of the file obtained from the operating system: for example last-modified, can-execute, length, or is-hidden.

on-error

fail | warning | ignore

Determines the action to be taken if one of the files cannot be successfully parsed.

parser

Java class name

Class name of the Java XMLReader to be used. For example, John Cowan's TagSoup parser may be selected by specifying parser=org.ccil.cowan.tagsoup.Parser (this parses arbitrary ill-formed HTML and presents it to Saxon as well-formed XML).

xinclude

yes | no

Determines whether XInclude processing should be applied to the selected documents. This overrides any setting in the Configuration (or any command line option).

stable

yes | no

Determines whether the collection is to be stable.

The pattern used in the select parameter can use glob-like syntax, for example *.xml selects all files with extension "xml". More generally, the pattern is converted to a regular expression by prepending "^", appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?", and it is then used to match the file names appearing in the directory using the Java regular expression rules. So, for example, you can write ?select=*.(xml|xhtml) to match files with either of these two file extensions. Note however, that special characters used in the URL (that is, characters such as backslash and curly braces that are not allowed in the query part of a URI) must be escaped using the %HH convention. For example, vertical bar needs to be written as %7C. This escaping can be achieved using the encode-for-uri() function.

As an alternative to the select parameter, the match parameter can be used. This accepts a standard XPath 3.1 regular expression as its value. For example, .+\.xml selects all files with extension "xml". Again, characters that are not allowed in the query part of a URI, such as backslash, curly braces, and vertical bar, must be escaped using the %HH convention, which can be achieved using the encode-for-uri() function.

A collection read in this way is not stable by default. (Stability can be expensive, and is rarely required, so the default setting is recommended.) Making a collection stable has the effect that the entire result of the collection() function is retained in a cache for the duration of the query or transformation, and any further calls on collection() with the same absolute URI return this saved collection retrieved from this cache.

Processing ZIP and JAR files

If the collection URI identifies a ZIP or JAR file then it is processed in exactly the same way as a directory. URI query parameters can be used in the same way, and have much the same effect.

A URI is recognized as a ZIP or JAR file URI if the scheme name is "jar", or if the file extension is "zip" or "jar".

The value of the recurse option is ignored in this case, and recurse=yes is assumed.

The option metadata=yes is available for ZIP-based collections as well as for directory-based collections. The set of properties returned in the resulting map is slightly different, for example it includes any comment field associated with the ZIP file entry. Note that no items are returned in respect of directory nodes within the ZIP file; only leaf nodes are represented.

Writing your own CollectionFinder

The CollectionFinder interface in Saxon 9.7 replaces the CollectionURIResolver interface in previous releases. It has much more flexibility, in particular the ability to deliver non-XML resources. The old CollectionURIResolver interface continues to be available alongside the new interface for the time being.

Details of the interface can be found in the Javadoc. The basic steps are:

  1. Write a class that implements CollectionFinder. It takes a single method, which accepts an absolute collection URI, and returns an object that implements ResourceCollection. Register an instance of your CollectionFinder with the Saxon Configuration.

  2. You can either reuse the existing implementations of ResourceCollection, namely CatalogCollection, DirectoryCollection, and JarCollection, or you can write your own. You can also of course subclass the existing collection classes. The ResourceCollection object provides two key methods that you need to implement: getResources(), which returns a sequence of Resource objects, and getResourceURIs(), which returns a sequence of URIs. These are invoked by the fn:collection() and fn:uri-collection() functions respectively.

  3. Again, you can either reuse existing implementations of Resource (such as XmlResource, JSONResource, UnparsedTextResource, BinaryResource, and MetadataResource), or you can create your own, perhaps by subclassing. The key method that the Resource object must provide is getItem() which returns the resource in the form of an XDM item. It is good practice to delay any extensive work such as parsing until the getItem() method is called: this reduces the memory footprint, and enables parallel evaluation of multiple threads (Saxon-EE only).

Registered Collections

On the .NET product there is another way to use a collection URI (provided that you use the API rather than the command line): you can register a collection using the Processor.RegisterCollection method on the Saxon.Api.Processor class.