(recode.info)UCS-2


Next: UCS-4 Prev: Universal Up: Universal
Enter node , (file) or (file)node

Universal Character Set, 2 bytes
================================

   One surface of `UCS' is usable for the subset defined by its first
sixty thousand characters (in fact, 31 * 2^11 codes), and uses exactly
two bytes per character.  It is a mere dump of the internal memory
representation which is _natural_ for this subset and as such, conveys
with it endianness problems.

   A non-empty `UCS-2' file normally begins with a so called "byte
order mark", having value `0xFEFF'.  The value `0xFFFE' is not an `UCS'
character, so if this value is seen at the beginning of a file,
`recode' reacts by swapping all pairs of bytes.  The library also
properly reacts to other occurrences of `0xFEFF' or `0xFFFE' elsewhere
than at the beginning, because concatenation of `UCS-2' files should
stay a simple matter, but it might trigger a diagnostic about non
canonical input.

   By default, when producing an `UCS-2' file, `recode' always outputs
the high order byte before the low order byte.  But this could be
easily overridden through the `21-Permutation' surface (Note:
Permutations).  For example, the command:

     recode u8..u2/21 < INPUT > OUTPUT

asks for an `UTF-8' to `UCS-2' conversion, with swapped byte output.

   Use `UCS-2' as a genuine charset.  This charset is available in
`recode' under the name `ISO-10646-UCS-2'.  Accepted aliases are
`UCS-2', `BMP', `rune' and `u2'.

   The `recode' library is able to combine `UCS-2' some sequences of
codes into single code characters, to represent a few diacriticized
characters, ligatures or diphtongs which have been included to ease
mapping with other existing charsets.  It is also able to explode such
single code characters into the corresponding sequence of codes.  The
request syntax for triggering such operations is rudimentary and
temporary.  The `combined-UCS-2' pseudo character set is a special form
of `UCS-2' in which known combinings have been replaced by the simpler
code.  Using `combined-UCS-2' instead of `UCS-2' in an _after_ position
of a request forces a combining step, while using `combined-UCS-2'
instead of `UCS-2' in a _before_ position of a request forces an
exploding step.  For the time being, one has to resort to advanced
request syntax to achieve other effects.  For example:

     recode u8..co,u2..u8 < INPUT > OUTPUT

copies an `UTF-8' INPUT over OUTPUT, still to be in `UTF-8', yet
merging combining characters into single codes whenever possible.


automatically generated by info2www version 1.2.2.9