Saxon implements the collection() and uri-collection() functions by passing the given collection URI (or null, if the default
collection is requested) to a user-provided CollectionFinder. This section describes how
the standard (default) collection finder behaves, if no user-written collection finder is
supplied. (For information on supplying a user-written
CollectionFinder, see Writing your own Collection Finder.)
There are some differences between Java and C#.
In XSLT 3.0 and XQuery 3.1, collections can contain resources other than XML documents: for example, JSON documents, plain text documents, and binary files.
The default collection can be registered with the
(or with its underlying
Configuration) by setting the Feature DEFAULT_COLLECTION.
The value takes the form of a
collection URI. When the
collection() function is called with no arguments, this
is exactly the same as supplying this default collection URI. If no default collection URI has
been registered, an empty collection is returned.
The standard collection finder supports four different kinds of collection: registered collections, catalog-based collections, directory-based collections, and (on Java only) zip-based collections:
A registered collection is one that has been explicitly registered with the Configuration, by calling
Note: this cannot currently be done at the level of the
If the collection URI corresponds to a directory name, then a directory-based collection is used: the collection contains selected files from the named directory.
If the collection URI identifies a ZIP or JAR file (for details see below) then a zip-based collection is used.
Otherwise, the collection URI must be the URI of an XML file which acts as a catalog, that is, it contains a list of the resources in the collection.
Saxon by default recognizes four kids of resource: XML documents, JSON documents, unparsed text documents, and binary files. The standard collection resolver attempts to identify which kind of resource to use based on the content type (media type), which in turn may be inferred from HTTP headers, from sniffing the initial bytes of the content, or from file extensions.
In the case of directory-based and ZIP-based collections, query parameters may be added to the collection URI to further control how it is to be processed.
Defining a collection using a collection catalog
If the collection URI identifies a file, Saxon treats this as a collection catalog. This is a file in XML format that lists the documents comprising the collection. Here is an example of such a catalog file:<collection stable="true"> <doc href="dir/contents.json"/> <doc href="dir/chap1.xml"/> <doc href="dir/chap2.xml"/> <doc href="dir/chap3.xml"/> <doc href="dir/chap4.xml"/> <doc href="dir/index.json"/> </collection>
stable attribute indicates whether the collection is stable or not. The
default value is
true. If a collection is stable, then the URIs listed in the
doc elements are treated like URIs passed to the
Each URI is first looked up in the document pool to see if it is already loaded; if it is,
then the corresponding value is returned (a document node in the case of XML resources).
Otherwise the URI is passed to the registered
URIResolver, and the resulting document is added to the document pool. The
effect of this process is firstly, that two calls on the
passing the same collection URI will return the same nodes each time, and secondly, that these
results are consistent with the results of the
doc() function: if the
document-uri() of a node returned by the
collection() function is
passed to the
doc() function, the original node will be returned. If
stable="false" is specified, however, the URI is dereferenced directly, and the
document is not added to the document pool, which means that a subsequent retrieval of the
same document will not return the same node.
If the URI passed to the
collection() function (still assuming a default
CollectionFinder) identifies a directory, then the contents of the
directory are returned. Such a URI may have a number of query parameters, written in the form
file:///a/b/c/d?keyword=value;keyword=value;.... The recognized keywords and
their values are as follows:
yes | no (default no)
Determines whether subdirectories are searched recursively.
yes | ignorable | no
Determines whether whitespace text nodes are to be stripped. The default depends on the Configuration settings.
strip | preserve | lax | strict
Determines whether and how schema validation is applied to each document. The default depends on the Configuration settings.
file name pattern ("glob")
Determines which files are selected (see below).
Determines which files are selected (see below).
media type (for example
Determines how the resource is processed. For example if the media type is
If this parameter is absent, then the CollectionFinder attempts to discern the content type first by looking at the file extension, and then, if necessary, by examining the initial bytes of the content itself.
The set of content types that are recognized, and their mapping to implementations of the
class ResourceFactory, is defined in the
Configuration, and can be changed using the
Available from Saxon 10.1.
yes | no
If set to yes, the item returned by the
The value of the "fetch" entry is a function that can be called to retrieve the
content (it returns the same item that would have been returned with the default
Failures in parsing a resource can be trapped by using try/catch around the call on
Other entries in the returned map represent properties of the file obtained from the
operating system: for example
fail | warning | ignore
Determines the action to be taken if one of the files cannot be successfully parsed.
Java class name
Class name of the Java
yes | no
Determines whether XInclude processing should be applied to the selected documents. This overrides any setting in the Configuration (or any command line option).
yes | no
Determines whether the collection is to be stable.
The pattern used in the
select parameter can use glob-like syntax, for example
*.xml selects all files with extension "xml". More generally, the pattern is
converted to a regular expression by prepending "
^", appending "
." by "
.*", and "
.?", and it is then used to match the file names appearing in the directory
using the Java regular expression rules. So, for example, you can write
?select=*.(xml|xhtml) to match files with either of these two file extensions.
Note however, that special characters used in the URL (that is, characters such as backslash
and curly braces that are not allowed in the query part of a URI) must be escaped using
the %HH convention. For example,
vertical bar needs to be written as
%7C. This escaping can be achieved using the
As an alternative to the
select parameter, the
can be used. This accepts a standard XPath 3.1 regular expression as its value. For example,
.+\.xml selects all files with extension "xml". Again, characters that are not allowed
in the query part of a URI, such as backslash, curly braces, and vertical bar, must be escaped
using the %HH convention, which can be achieved using the encode-for-uri() function.
A collection read in this way is not stable by default. (Stability can be expensive, and is
rarely required, so the default setting is recommended.) Making a collection stable has the
effect that the entire result of the
collection() function is retained in a cache
for the duration of the query or transformation, and any further calls on
collection() with the same absolute URI return this saved collection retrieved
from this cache.
Processing ZIP and JAR files
If the collection URI identifies a ZIP or JAR file then it is processed in exactly the same way as a directory. URI query parameters can be used in the same way, and have much the same effect.
A URI is recognized as a ZIP or JAR file URI if any of the following conditions applies:
- The standard collection finder is used and either (a) the URI uses the (non-standard)
"jar" URI scheme, or (b) the collection URI ends with
- A custom collection finder is used and the method
- A regular expression is supplied as the value of the configuration property
ZIP_URI_PATTERN, and the collection URI matches this regular expression.
The value of the
recurse option is ignored in this case, and
recurse=yes is assumed.
metadata=yes is available for ZIP-based collections as well as for
directory-based collections. The set of properties returned in the resulting map is slightly
different, for example it includes any
comment field associated with the ZIP file
entry. Note that no items are returned in respect of directory nodes within the ZIP file; only
leaf nodes are represented.
It is possible to register a collection explicitly with the Saxon
This is done using the method
the equivalent on .NET is
Processor.RegisterCollection(). When the URI provided
fn:collection() has been registered in this way, the
is not invoked.
On Java, the argument supplied to
Configuration.registerCollection() is an object of type
you can either use one of the standard collection types (
JarCollection, etc), or you can implement
Processor.RegisterCollection() accepts on object of type
IEnumerable<IResource>: you can implement your own kind of
and your own kind of
IResource, or you can use an existing implementation.