(recode.info)Mixed
Using mixed charset input
=========================
In real life and practice, textual files are often made up of many
charsets at once. Some parts of the file encode one charset, while
other parts encode another charset, and so forth. Usually, a file does
not toggle between more than two or three charsets. The means to
distinguish which charsets are encoded at various places is not always
available. The `recode' program is able to handle only a few simple
cases of mixed input.
The default `recode' behaviour is to expect pure charset files, to
be recoded as other pure charset files. However, the following options
allow for a few precise kinds of mixed charset files.
`-d'
`--diacritics'
While converting to or from one of `HTML' or `LaTeX' charset,
limit conversion to some subset of all characters. For `HTML',
limit conversion to the subset of all non-ASCII characters. For
`LaTeX', limit conversion to the subset of all non-English
letters. This is particularly useful, for example, when people
create what would be valid `HTML', TeX or LaTeX files, if only
they were using provided sequences for applying diacritics instead
of using the diacriticised characters directly from the underlying
character set.
While converting to `HTML' or `LaTeX' charset, this option assumes
that characters not in the said subset are properly coded or
protected already, `recode' then transmit them literally. While
converting the other way, this option prevents translating back
coded or protected versions of characters not in the said subset.
Note: HTML. Note: LaTeX.
`-S[LANGUAGE]'
`--source[=LANGUAGE]'
The bulk of the input file is expected to be written in `ASCII',
except for parts, like comments and string constants, which are
written using another charset than `ASCII'. When LANGUAGE is `c',
the recoding will proceed only with the contents of comments or
strings, while everything else will be copied without recoding.
When LANGUAGE is `po', the recoding will proceed only within
translator comments (those having whitespace immediately following
the initial `#') and with the contents of `msgstr' strings.
For the above things to work, the non-`ASCII' encoding of the
comment or string should be such that an `ASCII' scan will
successfully find where the comment or string ends.
Even if `ASCII' is the usual charset for writing programs, some
compilers are able to directly read other charsets, like `UTF-8',
say. There is currently no provision in `recode' for reading
mixed charset sources which are not based on `ASCII'. It is
probable that the need for mixed recoding is not as pressing in
such cases.
For example, after one does:
recode -Spo pc/..u8 < INPUT.po > OUTPUT.po
file `OUTPUT.po' holds a copy of `INPUT.po' in which _only_
translator comments and the contents of `msgstr' strings have been
recoded from the `IBM-PC' charset to pure `UTF-8', without
attempting conversion of end-of-lines. Machine generated comments
and original `msgid' strings are not to be touched by this
recoding.
If LANGUAGE is not specified, `c' is assumed.
automatically generated by info2www version 1.2.2.9