Directories as collections

If the URI passed to the collection() function (still assuming a default CollectionFinder) identifies a directory, then the contents of the directory are returned. Such a URI may have a number of query parameters, written in the form file:///a/b/c/d?keyword=value;keyword=value;.... The recognized keywords and their values are as follows:

keyword

values

effect

recurse

yes | no (default no)

Determines whether subdirectories are searched recursively.

strip-space

yes | ignorable | no

Determines whether whitespace text nodes are to be stripped. The default depends on the Configuration settings.

validation

strip | preserve | lax | strict

Determines whether and how schema validation is applied to each document. The default depends on the Configuration settings.

select

file name pattern ("glob")

Determines which files are selected (see below).

match

regular expression

Determines which files are selected (see below).

content-type

media type (for example application/xml or text/plain)

Determines how the resource is processed. For example if the media type is application/xml then it will be parsed as XML and returned as a document node; if it is text/plain then it is returned as an atomic value of type xs:string; if it is application/binary then it is returned as an atomic value of type xs:base64Binary.

If this parameter is absent, then the CollectionFinder attempts to discern the content type first by looking at the file extension, and then, if necessary, by examining the initial bytes of the content itself.

The set of content types that are recognized, and their mapping to implementations of the class ResourceFactory, is defined in the Configuration, and can be changed using the method Configuration.registerMediaType(). The set of file extensions that are recognized, and their mapping to media types, is also held in the Configuration, and can be changed using the method Configuration.registerFileExtension().

Available from Saxon 10.1.

metadata

yes | no

If set to yes, the item returned by the collection() function will be a map containing properties of the selected resource as well as its content. The keys of the map will be strings. Two entries with names "name" and "fetch" will always be available.

The value of the "fetch" entry is a function that can be called to retrieve the content (it returns the same item that would have been returned with the default setting of metadata=no: for example a node representing an XML document, or a map representing the content of a JSON file). This allows you to decide which items in the collection to fetch based on their properties, for example:

for $m in collection('/data/folder?metadata=yes') return if ($m?content-type='application/xml') then $m?fetch() else ()

Failures in parsing a resource can be trapped by using try/catch around the call on the fetch function.

Other entries in the returned map represent properties of the file obtained from the operating system: for example last-modified, can-execute, length, or is-hidden.

on-error

fail | warning | ignore

Determines the action to be taken if one of the files cannot be successfully parsed.

parser

Java class name

Class name of the Java XMLReader to be used. For example, John Cowan's TagSoup parser may be selected by specifying parser=org.ccil.cowan.tagsoup.Parser (this parses arbitrary ill-formed HTML and presents it to Saxon as well-formed XML).

xinclude

yes | no

Determines whether XInclude processing should be applied to the selected documents. This overrides any setting in the Configuration (or any command line option).

stable

yes | no

Determines whether the collection is to be stable.

The pattern used in the select parameter can use glob-like syntax, for example *.xml selects all files with extension "xml". More generally, the pattern is converted to a regular expression by prepending "^", appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?", and it is then used to match the file names appearing in the directory using the Java regular expression rules. So, for example, you can write ?select=*.(xml|xhtml) to match files with either of these two file extensions. Note however, that special characters used in the URL (that is, characters such as backslash and curly braces that are not allowed in the query part of a URI) must be escaped using the %HH convention. For example, vertical bar needs to be written as %7C. This escaping can be achieved using the encode-for-uri() function.

As an alternative to the select parameter, the match parameter can be used. This accepts a standard XPath 3.1 regular expression as its value. For example, .+\.xml selects all files with extension "xml". Again, characters that are not allowed in the query part of a URI, such as backslash, curly braces, and vertical bar, must be escaped using the %HH convention, which can be achieved using the encode-for-uri() function.

A collection read in this way is not stable by default. (Stability can be expensive, and is rarely required, so the default setting is recommended.) Making a collection stable has the effect that the entire result of the collection() function is retained in a cache for the duration of the query or transformation, and any further calls on collection() with the same absolute URI return this saved collection retrieved from this cache.