(recode.info)HTML


Next: LaTeX Prev: Miscellaneous Up: Miscellaneous
Enter node , (file) or (file)node

World Wide Web representations
==============================

   Character entities have been introduced by SGML and made widely
popular through HTML, the markup language in use for the World Wide
Web, or Web or WWW for short.  For representing _unusual_ characters,
HTML texts use special sequences, beginning with an ampersand `&' and
ending with a semicolon `;'.  The sequence may itself start with a
number sigh `#' and be followed by digits, so forming a "numeric
character reference", or else be an alphabetic identifier, so forming a
"character entity reference".

   The HTML standards have been revised into different HTML levels over
time, and the list of allowable character entities differ in them.  The
later XML, meant to simplify many things, has an option
(`standalone=yes') which much restricts that list.  The `recode'
library is able to convert character references between their mnemonic
form and their numeric form, depending on aimed HTML standard level.
It also can, of course, convert between HTML and various other charsets.

   Here is a list of those HTML variants which `recode' supports.  Some
notes have been provided by François Yergeau <yergeau@alis.com>.

`XML-standalone'
     This charset is available in `recode' under the name
     `XML-standalone', with `h0' as an acceptable alias.  It is
     documented in section 4.1 of `http://www.w3.org/TR/REC-xml'.  It
     only knows `&amp;', `&gt;', `&lt;', `&quot;' and `&apos;'.

`HTML_1.1'
     This charset is available in `recode' under the name `HTML_1.1',
     with `h1' as an acceptable alias.  HTML 1.0 was never really
     documented.

`HTML_2.0'
     This charset is available in `recode' under the name `HTML_2.0',
     and has `RFC1866', `1866' and `h2' for aliases.  HTML 2.0 entities
     are listed in RFC 1866.  Basically, there is an entity for each
     _alphabetical_ character in the right part of ISO 8859-1.  In
     addition, there are four entities for syntax-significant ASCII
     characters: `&amp;', `&gt;', `&lt;' and `&quot;'.

`HTML-i18n'
     This charset is available in `recode' under the name `HTML-i18n',
     and has `RFC2070' and `2070' for aliases.  RFC 2070 added entities
     to cover the whole right part of ISO 8859-1.  The list is
     conveniently accessible at
     `http://www.alis.com:8085/ietf/html/html-latin1.sgml'.  In
     addition, four i18n-related entities were added: `&zwnj;'
     (`&#8204;'), `&zwj;' (`&#8205;'), `&lrm;' (`&#8206') and `&rlm;'
     (`&#8207;').

`HTML_3.2'
     This charset is available in `recode' under the name `HTML_3.2',
     with `h3' as an acceptable alias.  HTML 3.2
     (http://www.w3.org/TR/REC-html32.html) took up the full Latin-1
     list but not the i18n-related entities from RFC 2070.

`HTML_4.0'
     This charset is available in `recode' under the name `HTML_4.0',
     and has `h4' and `h' for aliases.  Beware that the particular
     alias `h' is not _tied_ to HTML 4.0, but to the highest HTML level
     supported by `recode'; so it might later represent HTML level 5 if
     this is ever created.  HTML 4.0 (http://www.w3.org/TR/REC-html40/)
     has the whole Latin-1 list, a set of entities for symbols,
     mathematical symbols, and Greek letters, and another set for
     markup-significant and internationalization characters comprising
     the 4 ASCII entities, the 4 i18n-related from RFC 2070 plus some
     more.  See `http://www.w3.org/TR/REC-html40/sgml/entities.html'.

   Printable characters from Latin-1 may be used directly in an HTML
text.  However, partly because people have deficient keyboards, partly
because people want to transmit HTML texts over non 8-bit clean
channels while not using MIME, it is common (yet debatable) to use
character entity references even for Latin-1 characters, when they fall
outside ASCII (that is, when they have the 8th bit set).

   When you recode from another charset to `HTML', beware that all
occurrences of double quotes, ampersands, and left or right angle
brackets are translated into special sequences.  However, in practice,
people often use ampersands and angle brackets in the other charset for
introducing HTML commands, compromising it: it is not pure HTML, not it
is pure other charset.  These particular translations can be rather
inconvenient, they may be specifically inhibited through the command
option `-d' (Note: Mixed).

   Codes not having a mnemonic entity are output by `recode' using the
`&#NNN;' notation, where NNN is a decimal representation of the UCS
code value.  When there is an entity name for a character, it is always
preferred over a numeric character reference.  ASCII printable
characters are always generated directly.  So is the newline.  While
reading HTML, `recode' supports numeric character reference as alternate
writings, even when written as hexadecimal numbers, as in `&#xfffd'.
This is documented in:

     http://www.w3.org/TR/REC-html40/intro/sgmltut.html#h-3.2.3

   When `recode' translates to HTML, the translation occurs according to
the HTML level as selected by the goal charset.  When translating _from_
HTML, `recode' not only accepts the character entity references known at
that level, but also those of all other levels, as well as a few
alternative special sequences, to be forgiving to files using other
HTML standards.

   The `recode' program can be used to _normalise_ an HTML file using
oldish conventions.  For example, it accepts `&AE;', as this once was a
valid writing, somewhere.  However, it should always produce `&AElig;'
instead of `&AE;'.  Yet, this is not completely true.  If one does:

     recode h3..h3 < INPUT

the operation will be optimised into a mere copy, and you can get `&AE;'
this way, if you had some in your input file.  But if you explicitly
defeat the optimisation, like this maybe:

     recode h3..u2,u2..h3 < INPUT

then `&AE;' should be normalised into `&AElig;' by the operation.


automatically generated by info2www version 1.2.2.9