Saxonica.com

Character Encodings Supported

The encodings supported on input depend entirely on your choice of XML parser.

On output, any encoding supported by the Java VM or the .NET platform (as appropriate) may be used.

The encodings iso-646 and iso646 (in any mixture of upper and lower case) are recognized as synonyms of US-ASCII.

On the Java platform, there are some differences between the character encodings supported by the old java.io package and the new java.nio package. If the requested encoding is not supported by the java.nio package, then all non-ASCII characters will be represented using numeric character references. If the encoding is not supported by the java.io package, then Saxon will revert to using UTF-8 as the actual output encoding.

A list of the character encodings supported in the java.nio package can be obtained by using the command java net.sf.saxon.charcode.CharacterSetFactory, with no parameters. Java does not provide any means of determining the list of encodings supported by the java.io package.

On output, character encoding is a two stage process. Saxon itself has to decide whether a particular character is supported by the chosen encoding. If not, it converts the character to a numeric character reference if it appears in a context where this would be valid; otherwise (for example it it appears in an element name) it reports an error. Then the character has to be converted to the appropriate sequence of bytes: this second stage is delegated to the Java VM.

For the first stage, Saxon handles certain encodings itself, because this is more efficient and more reliable. If an encoding is used that is known to Java but not known to Saxon, Saxon attempts to discover from the Java VM whether particular characters are encodable are not. The encodings that Saxon recognizes directly (including synonyms) are ASCII, US-ASCII, iso-646, iso646, iso-8859-1, ISO8859_1, iso-8859-2, ISO8859_2, iso-8859-5, ISO8859_5, iso-8859-7, ISO8859_7, iso-8859-8, ISO8859_8, iso-8859-9, ISO8859_9, UTF-8, UTF8, UTF-16, UTF16, KOI8-R, Big5, SJIS, Shift_JIS, EUC_CN, GB2312, EUC-JP, EUC-KR cp1250, windows-1250, cp1251, windows-1251, cp1252, windows-1252, cp852, windows-852.

Next