(recode.info)Mixed


Next: Emacs Prev: Sequencing Up: Invoking recode
Enter node , (file) or (file)node

Using mixed charset input
=========================

   In real life and practice, textual files are often made up of many
charsets at once.  Some parts of the file encode one charset, while
other parts encode another charset, and so forth.  Usually, a file does
not toggle between more than two or three charsets.  The means to
distinguish which charsets are encoded at various places is not always
available.  The `recode' program is able to handle only a few simple
cases of mixed input.

   The default `recode' behaviour is to expect pure charset files, to
be recoded as other pure charset files.  However, the following options
allow for a few precise kinds of mixed charset files.

`-d'
`--diacritics'
     While converting to or from one of `HTML' or `LaTeX' charset,
     limit conversion to some subset of all characters.  For `HTML',
     limit conversion to the subset of all non-ASCII characters.  For
     `LaTeX', limit conversion to the subset of all non-English
     letters.  This is particularly useful, for example, when people
     create what would be valid `HTML', TeX or LaTeX files, if only
     they were using provided sequences for applying diacritics instead
     of using the diacriticised characters directly from the underlying
     character set.

     While converting to `HTML' or `LaTeX' charset, this option assumes
     that characters not in the said subset are properly coded or
     protected already, `recode' then transmit them literally.  While
     converting the other way, this option prevents translating back
     coded or protected versions of characters not in the said subset.
     Note: HTML.  Note: LaTeX.

`-S[LANGUAGE]'
`--source[=LANGUAGE]'
     The bulk of the input file is expected to be written in `ASCII',
     except for parts, like comments and string constants, which are
     written using another charset than `ASCII'.  When LANGUAGE is `c',
     the recoding will proceed only with the contents of comments or
     strings, while everything else will be copied without recoding.
     When LANGUAGE is `po', the recoding will proceed only within
     translator comments (those having whitespace immediately following
     the initial `#') and with the contents of `msgstr' strings.

     For the above things to work, the non-`ASCII' encoding of the
     comment or string should be such that an `ASCII' scan will
     successfully find where the comment or string ends.

     Even if `ASCII' is the usual charset for writing programs, some
     compilers are able to directly read other charsets, like `UTF-8',
     say.  There is currently no provision in `recode' for reading
     mixed charset sources which are not based on `ASCII'.  It is
     probable that the need for mixed recoding is not as pressing in
     such cases.

     For example, after one does:

          recode -Spo pc/..u8 < INPUT.po > OUTPUT.po

     file `OUTPUT.po' holds a copy of `INPUT.po' in which _only_
     translator comments and the contents of `msgstr' strings have been
     recoded from the `IBM-PC' charset to pure `UTF-8', without
     attempting conversion of end-of-lines.  Machine generated comments
     and original `msgid' strings are not to be touched by this
     recoding.

     If LANGUAGE is not specified, `c' is assumed.


automatically generated by info2www version 1.2.2.9