(flex.info)Are certain equivalent patterns faster than others?


Are certain equivalent patterns faster than others?
===================================================

     To: Adoram Rogel <adoram@orna.hybridge.com>
     Subject: Re: Flex 2.5.2 performance questions
     In-reply-to: Your message of Wed, 18 Sep 96 11:12:17 EDT.
     Date: Wed, 18 Sep 96 10:51:02 PDT
     From: Vern Paxson <vern>
     
     [Note, the most recent flex release is 2.5.4, which you can get from
     ftp.ee.lbl.gov.  It has bug fixes over 2.5.2 and 2.5.3.]
     
     > 1. Using the pattern
     >    ([Ff](oot)?)?[Nn](ote)?(\.)?
     >    instead of
     >    (((F|f)oot(N|n)ote)|((N|n)ote)|((N|n)\.)|((F|f)(N|n)(\.)))
     >    (in a very complicated flex program) caused the program to slow from
     >    300K+/min to 100K/min (no other changes were done).
     
     These two are not equivalent.  For example, the first can match "footnote."
     but the second can only match "footnote".  This is almost certainly the
     cause in the discrepancy - the slower scanner run is matching more tokens,
     and/or having to do more backing up.
     
     > 2. Which of these two are better: [Ff]oot or (F|f)oot ?
     
     From a performance point of view, they're equivalent (modulo presumably
     minor effects such as memory cache hit rates; and the presence of trailing
     context, see below).  From a space point of view, the first is slightly
     preferable.
     
     > 3. I have a pattern that look like this:
     >    pats {p1}|{p2}|{p3}|...|{p50}     (50 patterns ORd)
     >
     >    running yet another complicated program that includes the following rule:
     >    <snext>{and}/{no4}{bb}{pats}
     >
     >    gets me to "too complicated - over 32,000 states"...
     
     I can't tell from this example whether the trailing context is variable-length
     or fixed-length (it could be the latter if {and} is fixed-length).  If it's
     variable length, which flex -p will tell you, then this reflects a basic
     performance problem, and if you can eliminate it by restructuring your
     scanner, you will see significant improvement.
     
     >    so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about
     >    10 patterns and changed the rule to be 5 rules.
     >    This did compile, but what is the rule of thumb here ?
     
     The rule is to avoid trailing context other than fixed-length, in which for
     a/b, either the 'a' pattern or the 'b' pattern have a fixed length.  Use
     of the '|' operator automatically makes the pattern variable length, so in
     this case '[Ff]oot' is preferred to '(F|f)oot'.
     
     > 4. I changed a rule that looked like this:
     >    <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN...
     >
     >    to the next 2 rules:
     >    <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;}
     >    <snext8>{and}{bb}/{ROMAN}         { BEGIN...
     >
     >    Again, I understand the using [^...] will cause a great performance loss
     
     Actually, it doesn't cause any sort of performance loss.  It's a surprising
     fact about regular expressions that they always match in linear time
     regardless of how complex they are.
     
     >    but are there any specific rules about it ?
     
     See the "Performance Considerations" section of the man page, and also
     the example in MISC/fastwc/.
     
     		Vern
automatically generated by info2www version 1.2.2.9