(recode.info)Requests


The REQUEST parameter
=====================

   In the case where the REQUEST is merely written as BEFORE..AFTER,
then BEFORE and AFTER specify the start charset and the goal charset
for the recoding.

   For `recode', charset names may contain any character, besides a
comma, a forward slash, or two periods in a row.  But in practice,
charset names are currently limited to alphabetic letters (upper or
lower case), digits, hyphens, underlines, periods, colons or round
parentheses.

   The complete syntax for a valid REQUEST allows for unusual things,
which might surprise at first.  (Do not pay too much attention to these
facilities on first reading.)  For example, REQUEST may also contain
intermediate charsets, like in the following example:

     BEFORE..INTERIM1..INTERIM2..AFTER

meaning that `recode' should internally produce the INTERIM1 charset
from the start charset, then work out of this INTERIM1 charset to
internally produce INTERIM2, and from there towards the goal charset.
In fact, `recode' internally combines recipes and automatically uses
interim charsets, when there is no direct recipe for transforming
BEFORE into AFTER.  But there might be many ways to do it.  When many
routes are possible, the above "chaining" syntax may be used to more
precisely force the program towards a particular route, which it might
not have naturally selected otherwise.  On the other hand, because
`recode' tries to choose good routes, chaining is only needed to
achieve some rare, unusual effects.

   Moreover, many such requests (sub-requests, more precisely) may be
separated with commas (but no spaces at all), indicating a sequence of
recodings, where the output of one has to serve as the input of the
following one.  For example, the two following requests are equivalent:

     BEFORE..INTERIM1..INTERIM2..AFTER
     BEFORE..INTERIM1,INTERIM1..INTERIM2,INTERIM2..AFTER

In this example, the charset input for any recoding sub-request is
identical to the charset output by the preceding sub-request.  But it
does not have to be so in the general case.  One might wonder what
would be the meaning of declaring the charset input for a recoding
sub-request of being of different nature than the charset output by a
preceding sub-request, when recodings are chained in this way.  Such a
strange usage might have a meaning and be useful for the `recode'
expert, but they are quite uncommon in practice.

   More useful is the distinction between the concept of charset, and
the concept of surfaces.  An encoded charset is represented by:

     PURE-CHARSET/SURFACE1/SURFACE2...

using slashes to introduce surfaces, if any.  The order of application
of surfaces is usually important, they cannot be freely commuted.  In
the given example, SURFACE1 is first applied over the PURE-CHARSET,
then SURFACE2 is applied over the result.  Given this request:

     BEFORE/SURFACE1/SURFACE2..AFTER/SURFACE3

the `recode' program will understand that the input files should have
SURFACE2 removed first (because it was applied last), then SURFACE1
should be removed.  The next step will be to translate the codes from
charset BEFORE to charset AFTER, prior to applying SURFACE3 over the
result.

   Some charsets have one or more _implied_ surfaces.  In this case, the
implied surfaces are automatically handled merely by naming the charset,
without any explicit surface to qualify it.  Let's take an example to
illustrate this feature.  The request `pc..l1' will indeed decode MS-DOS
end of lines prior to converting IBM-PC codes to Latin-1, because `pc'
is the name of a charset(1) which has `CR-LF' for its usual surface.
The request `pc/..l1' will _not_ decode end of lines, since the slash
introduces surfaces, and even if the surface list is empty, it
effectively defeats the automatic removal of surfaces for this charset.
So, empty surfaces are useful, indeed!

   Both charsets and surfaces may have predefined alternate names, or
aliases.  However, and this is rather important to understand, implied
surfaces are attached to individual aliases rather than on genuine
charsets.  Consequently, the official charset name and all of its
aliases do not necessarily share the same implied surfaces.  The
charset and all its aliases may each have its own different set of
implied surfaces.

   Charset names, surface names, or their aliases may always be
abbreviated to any unambiguous prefix.  Internally in `recode',
disambiguating tables are kept separate for charset names and surface
names.

   While recognising a charset name or a surface name (or aliases
thereof), `recode' ignores all characters besides letters and digits,
so for example, the hyphens and underlines being part of an official
charset name may safely be omitted (no need to un-confuse them!).
There is also no distinction between upper and lower case for charset
or surface names.

   One of the BEFORE or AFTER keywords may be omitted.  If the double
dot separator is omitted too, then the charset is interpreted as the
BEFORE charset.(2)

   When a charset name is omitted or left empty, the value of the
`DEFAULT_CHARSET' variable in the environment is used instead.  If this
variable is not defined, the `recode' library uses the current locale's
encoding. On POSIX compliant systems, this depends on the first
non-empty value among the environment variables LC_ALL, LC_CTYPE, LANG,
and can be determined through the command `locale charmap'.

   If the charset name is omitted but followed by surfaces, the surfaces
then qualify the usual or default charset.  For example, the request
`../x' is sufficient for applying an hexadecimal surface to the input
text(3).

   The allowable values for BEFORE or AFTER charsets, and various
surfaces, are described in the remainder of this document.

   ---------- Footnotes ----------

   (1) More precisely, `pc' is an alias for the charset `IBM-PC'.

   (2) Both BEFORE and AFTER may be omitted, in which case the double
dot separator is mandatory.  This is not very useful, as the recoding
reduces to a mere copy in that case.

   (3) MS-DOS is one of those systems for which the default charset has
implied surfaces, `CR-LF' here.  Such surfaces are automatically
removed or applied whenever the default charset is read or written,
exactly as it would go for any other charset.  In the example above, on
such systems, the hexadecimal surface would then _replace_ the implied
surfaces.  For _adding_ an hexadecimal surface without removing any,
one should write the request as `/../x'.
automatically generated by info2www version 1.2.2.9