Serialization
It should now be possible to use any output encoding that is supported by the Java VM, without defining
a custom CharacterSet class. In JDK 1.4, Java allows the application to determine whether
particular characters are encodable using a given character set, and this information is now used to
decide whether to replace the character with a numeric character reference. Because I don't know how
efficient this mechanism is, I still use the old mechanism for character sets that were previously
supported in Saxon, and the mechanism for defining user-defined character sets is still available
for the time being. It has been restricted, however, so that Saxon will only attempt to load a
PluggableCharacterSet for encoding XXX if the output property encoding.XXX="class-name"
is present.
The code now allows for the possibility that character encodings other than UTF-8 and UTF-16 may be capable of encoding supplemental characters (characters whose Unicode codepoints are above 65535). Previously such characters were always output as numeric character references, except when using UTF-8 and UTF-16. A consequence of this is that user-written PluggableCharacterSet implementations must be prepared to categorize such characters.
There are some differences between the character encodings supported by the old java.io package
and the new java.nio package. If the requested encoding is not supported by the java.nio package, then
all non-ASCII characters will be represented using numeric character references. If the encoding is
not supported by the java.io package, then Saxon will revert to using UTF-8 as the actual output
encoding. A list of the character encodings
supported in the java.nio package can be obtained by using the command java net.sf.saxon.charcode.CharacterSetFactory,
with no parameters.
The HTML serialization method should now handle INS and DEL elements correctly.
User-written emitters were not working; the code has been fixed but not tested.