(recode.info)Universal
The universal charset
*********************
Standard ISO 10646 defines a universal character set, intended to
encompass in the long run all languages written on this planet. It is
based on wide characters, and offer possibilities for two billion
characters (2^31).
This charset was to become available in `recode' under the name
`UCS', with many external surfaces for it. But in the current version,
only surfaces of `UCS' are offered, each presented as a genuine charset
rather than a surface. Such surfaces are only meaningful for the `UCS'
charset, so it is not that useful to draw a line between the surfaces
and the only charset to which they may apply.
`UCS' stands for Universal Character Set. `UCS-2' and `UCS-4' are
fixed length encodings, using two or four bytes per character
respectively. `UTF' stands for `UCS' Transformation Format, and are
variable length encodings dedicated to `UCS'. `UTF-1' was based on
ISO 2022, it did not succeed(1). `UTF-2' replaced it, it has been
called `UTF-FSS' (File System Safe) in Unicode or Plan9 context, but is
better known today as `UTF-8'. To complete the picture, there is
`UTF-16' based on 16 bits bytes, and `UTF-7' which is meant for
transmissions limited to 7-bit bytes. Most often, one might see
`UTF-8' used for external storage, and `UCS-2' used for internal
storage.
When `recode' is producing any representation of `UCS', it uses the
replacement character `U+FFFD' for any _valid_ character which is not
representable in the goal charset(2). This happens, for example, when
`UCS-2' is not capable to echo a wide `UCS-4' character, or for a
similar reason, an `UTF-8' sequence using more than three bytes. The
replacement character is meant to represent an existing character. So,
it is never produced to represent an invalid sequence or ill-formed
character in the input text. In such cases, `recode' just gets rid of
the noise, while taking note of the error in its usual ways.
Even if `UTF-8' is an encoding, really, it is the encoding of a
single character set, and nothing else. It is useful to distinguish
between an encoding (a _surface_ within `recode') and a charset, but
only when the surface may be applied to several charsets. Specifying a
charset is a bit simpler than specifying a surface in a `recode'
request. There would not be a practical advantage at imposing a more
complex syntax to `recode' users, when it is simple to assimilate
`UTF-8' to a charset. Similar considerations apply for `UCS-2',
`UCS-4', `UTF-16' and `UTF-7'. These are all considered to be charsets.
- UCS-2
- Universal Character Set, 2 bytes
- UCS-4
- Universal Character Set, 4 bytes
- UTF-7
- Universal Transformation Format, 7 bits
- UTF-8
- Universal Transformation Format, 8 bits
- UTF-16
- Universal Transformation Format, 16 bits
- count-characters
- Frequency count of characters
- dump-with-names
- Fully interpreted UCS dump
---------- Footnotes ----------
(1) It is not probable that `recode' will ever support `UTF-1'.
(2) This is when the goal charset allows for 16-bits. For shorter
charsets, the `--strict' (`-s') option decides what happens: either the
character is dropped, or a reversible mapping is produced on the fly.
automatically generated by info2www version 1.2.2.9