(grep.info)Performance


Next: Reporting Bugs Prev: Usage Up: Top
Enter node , (file) or (file)node

5 Performance
*************

Typically ‘grep’ is an efficient way to search text.  However, it can be
quite slow in some cases, and it can search large files where even minor
performance tweaking can help significantly.  Although the algorithm
used by ‘grep’ is an implementation detail that can change from release
to release, understanding its basic strengths and weaknesses can help
you improve its performance.

   The ‘grep’ command operates partly via a set of automata that are
designed for efficiency, and partly via a slower matcher that takes over
when the fast matchers run into unusual features like back-references.
When feasible, the Boyer–Moore fast string searching algorithm is used
to match a single fixed pattern, and the Aho–Corasick algorithm is used
to match multiple fixed patterns.

   Generally speaking ‘grep’ operates more efficiently in single-byte
locales, since it can avoid the special processing needed for multi-byte
characters.  If your patterns will work just as well that way, setting
‘LC_ALL’ to a single-byte locale can help performance considerably.
Setting ‘LC_ALL='C'’ can be particularly efficient, as ‘grep’ is tuned
for that locale.

   Outside the ‘C’ locale, case-insensitive search, and search for
bracket expressions like ‘[a-z]’ and ‘[[=a=]b]’, can be surprisingly
inefficient due to difficulties in fast portable access to concepts like
multi-character collating elements.

   A back-reference such as ‘\1’ can hurt performance significantly in
some cases, since back-references cannot in general be implemented via a
finite state automaton, and instead trigger a backtracking algorithm
that can be quite inefficient.  For example, although the pattern
‘^(.*)\1{14}(.*)\2{13}$’ matches only lines whose lengths can be written
as a sum 15x + 14y for nonnegative integers x and y, the pattern matcher
does not perform linear Diophantine analysis and instead backtracks
through all possible matching strings, using an algorithm that is
exponential in the worst case.

   On some operating systems that support files with holes—large regions
of zeros that are not physically present on secondary storage—‘grep’ can
skip over the holes efficiently without needing to read the zeros.  This
optimization is not available if the ‘-a’ (‘--binary-files=text’) option
is used (Note: File and Directory Selection), unless the ‘-z’
(‘--null-data’) option is also used (Note: Other Options).

   For more about the algorithms used by ‘grep’ and about related string
matching algorithms, see:

   • Aho AV. Algorithms for finding patterns in strings. In: van Leeuwen
     J. _Handbook of Theoretical Computer Science_, vol. A. New York:
     Elsevier; 1990. p. 255–300. This surveys classic string matching
     algorithms, some of which are used by ‘grep’.

   • Aho AV, Corasick MJ. Efficient string matching: an aid to
     bibliographic search. _CACM_. 1975;18(6):333–40.
     <https://dx.doi.org/10.1145/360825.360855>. This introduces the
     Aho–Corasick algorithm.

   • Boyer RS, Moore JS. A fast string searching algorithm. _CACM_.
     1977;20(10):762–72. <https://dx.doi.org/10.1145/359842.359859>.
     This introduces the Boyer–Moore algorithm.

   • Faro S, Lecroq T. The exact online string matching problem: a
     review of the most recent results. _ACM Comput Surv_.
     2013;45(2):13. <https://dx.doi.org/10.1145/2431211.2431212>. This
     surveys string matching algorithms that might help improve the
     performance of ‘grep’ in the future.


automatically generated by info2www version 1.2.2.9