(recode.info)Universal


The universal charset
*********************

   Standard ISO 10646 defines a universal character set, intended to
encompass in the long run all languages written on this planet.  It is
based on wide characters, and offer possibilities for two billion
characters (2^31).

   This charset was to become available in `recode' under the name
`UCS', with many external surfaces for it.  But in the current version,
only surfaces of `UCS' are offered, each presented as a genuine charset
rather than a surface.  Such surfaces are only meaningful for the `UCS'
charset, so it is not that useful to draw a line between the surfaces
and the only charset to which they may apply.

   `UCS' stands for Universal Character Set.  `UCS-2' and `UCS-4' are
fixed length encodings, using two or four bytes per character
respectively.  `UTF' stands for `UCS' Transformation Format, and are
variable length encodings dedicated to `UCS'.  `UTF-1' was based on
ISO 2022, it did not succeed(1).  `UTF-2' replaced it, it has been
called `UTF-FSS' (File System Safe) in Unicode or Plan9 context, but is
better known today as `UTF-8'.  To complete the picture, there is
`UTF-16' based on 16 bits bytes, and `UTF-7' which is meant for
transmissions limited to 7-bit bytes.  Most often, one might see
`UTF-8' used for external storage, and `UCS-2' used for internal
storage.

   When `recode' is producing any representation of `UCS', it uses the
replacement character `U+FFFD' for any _valid_ character which is not
representable in the goal charset(2).  This happens, for example, when
`UCS-2' is not capable to echo a wide `UCS-4' character, or for a
similar reason, an `UTF-8' sequence using more than three bytes.  The
replacement character is meant to represent an existing character.  So,
it is never produced to represent an invalid sequence or ill-formed
character in the input text.  In such cases, `recode' just gets rid of
the noise, while taking note of the error in its usual ways.

   Even if `UTF-8' is an encoding, really, it is the encoding of a
single character set, and nothing else.  It is useful to distinguish
between an encoding (a _surface_ within `recode') and a charset, but
only when the surface may be applied to several charsets.  Specifying a
charset is a bit simpler than specifying a surface in a `recode'
request.  There would not be a practical advantage at imposing a more
complex syntax to `recode' users, when it is simple to assimilate
`UTF-8' to a charset.  Similar considerations apply for `UCS-2',
`UCS-4', `UTF-16' and `UTF-7'.  These are all considered to be charsets.

UCS-2: Universal Character Set, 2 bytes
UCS-4: Universal Character Set, 4 bytes
UTF-7: Universal Transformation Format, 7 bits
UTF-8: Universal Transformation Format, 8 bits
UTF-16: Universal Transformation Format, 16 bits
count-characters: Frequency count of characters
dump-with-names: Fully interpreted UCS dump

---------- Footnotes ---------- (1) It is not probable that `recode' will ever support `UTF-1'. (2) This is when the goal charset allows for 16-bits. For shorter charsets, the `--strict' (`-s') option decides what happens: either the character is dropped, or a reversible mapping is produced on the fly.

automatically generated by info2www version 1.2.2.9