(recode.info)UTF-8


Next: UTF-16 Prev: UTF-7 Up: Universal
Enter node , (file) or (file)node

Universal Transformation Format, 8 bits
=======================================

   Even if `UTF-8' does not originally come from IETF, there is now
RFC 2279 to describe it.  In letters sent on 1995-01-21 and 1995-04-20,
Markus Kuhn writes:

     `UTF-8' is an `ASCII' compatible multi-byte encoding of the
     ISO 10646 universal character set (`UCS').  `UCS' is a 31-bit
     superset of all other character set standards.  The first 256
     characters of `UCS' are identical to those of ISO 8859-1
     (Latin-1).  The `UCS-2' encoding of UCS is a sequence of bigendian
     16-bit words, the `UCS-4' encoding is a sequence of bigendian
     32-bit words.  The `UCS-2' subset of ISO 10646 is also known as
     "Unicode".  As both `UCS-2' and `UCS-4' require heavy
     modifications to traditional `ASCII' oriented system designs (e.g.
     Unix), the `UTF-8' encoding has been designed for these
     applications.

     In `UTF-8', only `ASCII' characters are encoded using bytes below
     128.  All other non-ASCII characters are encoded as multi-byte
     sequences consisting only of bytes in the range 128-253.  This
     avoids critical bytes like `NUL' and `/' in `UTF-8' strings, which
     makes the `UTF-8' encoding suitable for being handled by the
     standard C string library and being used in Unix file names.
     Other properties include the preserved lexical sorting order and
     that `UTF-8' allows easy self-synchronisation of software
     receiving `UTF-8' strings.

   `UTF-8' is the most common external surface of `UCS', each character
uses from one to six bytes, and is able to encode all 2^31 characters
of the `UCS'.  It is implemented as a charset, with the following
properties:

   * Strict 7-bit `ASCII' is completely invariant under `UTF-8', and
     those are the only one-byte characters.  `UCS' values and `ASCII'
     values coincide.  No multi-byte characters ever contain bytes less
     than 128.  `NUL' _is_ `NUL'.  A multi-byte character always starts
     with a byte of 192 or more, and is always followed by a number of
     bytes between 128 to 191.  That means that you may read at random
     on disk or memory, and easily discover the start of the current,
     next or previous character.  You can count, skip or extract
     characters with this only knowledge.

   * If you read the first byte of a multi-byte character in binary, it
     contains many `1' bits in successions starting with the most
     significant one (from the left), at least two.  The length of this
     `1' sequence equals the byte size of the character.  All
     succeeding bytes start by `10'.  This is a lot of redundancy,
     making it fairly easy to guess that a file is valid `UTF-8', or to
     safely state that it is not.

   * In a multi-byte character, if you remove all leading `1' bits of
     the first byte of a multi-byte character, and the initial `10'
     bits of all remaining bytes (so keeping 6 bits per byte for
     those), the remaining bits concatenated are the UCS value.

These properties also have a few nice consequences:

   * Conversion to/from values is algorithmically simple, and
     reasonably speedy.

   * A sequence of N bytes can hold characters needing up to 2 + 5N
     bits in their `UCS' representation.  Here, N is a number between 1
     and 6.  So, `UTF-8' is most economical when mapping ASCII (1 byte),
     followed by `UCS-2' (1 to 3 bytes) and `UCS-4' (1 to 6 bytes).

   * The lexicographic sorting order of `UCS' strings is preserved.

   * Bytes with value 254 or 255 never appear, and because of that,
     these are sometimes used when escape mechanisms are needed.

   In some case, when little processing is done on a lot of strings,
one may choose for efficiency reasons to handle `UTF-8' strings
directly even if variable length, as it is easy to get start of
characters.  Character insertion or replacement might require moving
the remainder of the string in either direction.  In most cases, it is
faster and easier to convert from `UTF-8' to `UCS-2' or `UCS-4' prior
to processing.

   This charset is available in `recode' under the name `UTF-8'.
Accepted aliases are `UTF-2', `UTF-FSS', `FSS_UTF', `TF-8' and `u8'.


automatically generated by info2www version 1.2.2.9