(recode.info)Main flow


Next: New charsets Prev: Internals Up: Internals
Enter node , (file) or (file)node

Overall organisation
====================

   The `recode' mechanics slowly evolved for many years, and it would
be tedious to explain all problems I met and mistakes I did all along,
yielding the current behaviour.  Surely, one of the key choices was to
stop trying to do all conversions in memory, one line or one buffer at
a time.  It has been fruitful to use the character stream paradigm, and
the elementary recoding steps now convert a whole stream to another.
Most of the control complexity in `recode' exists so that each
elementary recoding step stays simple, making easier to add new ones.
The whole point of `recode', as I see it, is providing a comfortable
nest for growing new charset conversions.

   The main `recode' driver constructs, while initialising all
conversion modules, a table giving all the conversion routines
available ("single step"s) and for each, the starting charset and the
ending charset.  If we consider these charsets as being the nodes of a
directed graph, each single step may be considered as oriented arc from
one node to the other.  A cost is attributed to each arc: for example,
a high penalty is given to single steps which are prone to losing
characters, a lower penalty is given to those which need studying more
than one input character for producing an output character, etc.

   Given a starting code and a goal code, `recode' computes the most
economical route through the elementary recodings, that is, the best
sequence of conversions that will transform the input charset into the
final charset.  To speed up execution, `recode' looks for subsequences
of conversions which are simple enough to be merged, and then
dynamically creates new single steps to represent these mergings.

   A "double step" in `recode' is a special concept representing a
sequence of two single steps, the output of the first single step being
the special charset `UCS-2', the input of the second single step being
also `UCS-2'.  Special `recode' machinery dynamically produces
efficient, reversible, merge-able single steps out of these double
steps.

   I made some statistics about how many internal recoding steps are
required between any two charsets chosen at random.  The initial
recoding layout, before optimisation, always uses between 1 and 5
steps.  Optimisation could sometimes produce mere copies, which are
counted as no steps at all.  In other cases, optimisation is unable to
save any step.  The number of steps after optimisation is currently
between 0 and 5 steps.  Of course, the _expected_ number of steps is
affected by optimisation: it drops from 2.8 to 1.8.  This means that
`recode' uses a theoretical average of a bit less than one step per
recoding job.  This looks good.  This was computed using reversible
recodings.  In strict mode, optimisation might be defeated somewhat.
Number of steps run between 1 and 6, both before and after
optimisation, and the expected number of steps decreases by a lesser
amount, going from 2.2 to 1.3.  This is still manageable.


automatically generated by info2www version 1.2.2.9