(a2ps.info)What is an Encoding


6.1 What is an Encoding
=======================

This section is actually taken from the web pages of Alis Technologies
inc.  (http://www.alis.com/)

   Document encoding is the most important but also the most sensitive
and explosive topic in Internet internationalization.  It is an
essential factor since most of the information distributed over the
Internet is in text format.  But the history of the Internet is such
that the predominant - and in some cases the only possible - encoding is
the very limited ASCII, which can represent only a handful of languages,
only three of which are used to any great extent: English, Indonesian
and Swahili.

   All the other languages, spoken by more than 90% of the world's
population, must fall back on other character sets.  And there is a
plethora of them, created over the years to satisfy writing constraints
and constantly changing technological limitations.  The ISO
international character set registry contains only a small fraction;
IBM's character registry is over three centimeters thick; Microsoft and
Apple each have a bunch of their own, as do other software manufacturers
and editors.

   The problem is not that there are too few but rather too many
choices, at least whenever Internet standards allow them.  And the
surplus is a real problem; if every Arabic user made his own choice
among the three dozen or so codes available for this language, there is
little likelihood that his "neighbor" would do the same and that they
would thus be able to understand each other.  This example is rather
extreme, but it does illustrate the importance of standards in the area
of internationalization.  For a group of users sharing the same language
to be able to communicate,

  1. the code used in the shared document must always be identified
     (labeling)

  2. they must agree on a small number of codes - only one, if possible
     (standards);

  3. their software must recognize and process all codes (versatility)

   Certain character sets stand out either because of their status as an
official national or international standard, or simply because of their
widespread use.

   First off, there is the ISO 8859 standards series that standardize a
dozen character sets that are useful for a large number of languages
using the Latin, Cyrillic, Arabic, Greek and Hebrew alphabets.  These
standards have a limited range of application (8 bits per character, a
maximum of 190 characters, no combining) but where they suffice (as they
do for 10 of the 20 most widely used languages), they should be used on
the Internet in preference to other codes.  For all other languages,
national standards should preferably be chosen or, if none are
available, a well-known and widely-used code should be the second
choice.

   Even when we limit ourselves to the most widely used standards, the
overabundance remains considerable, and this significantly complicates
life for truly international software developers and users of several
languages, especially when such languages can only be represented by a
single code.  It was to resolve this problem that both Unicode and the
ISO 10646 International standard were created.  Two standards?  Oh no!
Their designers soon realized the problem and were able to cooperate to
the extent of making the character set "repertoires" and coding
identical.

   ISO 10646 (and Unicode) contain over 30,000 characters capable of
representing most of the living languages within a single code.  All of
these characters, except for the _Han_ (Chinese characters also used in
Japanese and Korean), have a name.  And there is still room to encode
the missing languages as soon as enough of the necessary research is
done.  Unicode can be used to represent several languages, using
different alphabets, within the same electronic document.
automatically generated by info2www version 1.2.2.9