Comments on Usage.
QTGrep and QTGrepb have been designed to scan
for
regular expressions and keywords.
The keyword scan uses one of three algorithms:
- Brute force: Used for single keywords only. Method: Scan
for first letter of single keyword and then match remaining characters
of keyword to scanned text. This method used for single keywords of
three characters or less only. This method fastest for these keywords
as per reference: "Information Retrieval, Data Structures and
Algorithms"
Ed. William B. Frakes & Ricardo Baeza-Yates
1992 Prentice Hall
- Boyer-Moore-Horspool Algorithm. Used for single keywords
greater than three characters in length. This method fastest as per
above reference. Algorithm detailed in above reference or:
"Compilers, Principles, Techniques and Tools"
by: Alfred V. Aho,
Ravi Sethi, and
Jeffrey D. Ullman
Published by: Addison-Wesley, Copyright 1986
- Failure Trie as detailed in "Compilers, Principles,
Techniques and Tools" reference above, page 151. This method used
exclusively for multiple keywords.
The Regular Expression scan uses a single algorithm for all scans. The
method is described in the reference: "Compilers, Principles,
Techniques and Tools". The method first converts all regular
expressions into a Non-definite Finite Automaton, NFA. For Case
Insensitive searches, the NFA is then converted such
that all characters are in lower case. For backward searches the NFAs
are
then reversed. Four distinct NFAs are obtained thusly. All NFAs have
the
same number of states and transitions. Unfortunately, scanning text
based on NFAs is slow and requires maintaining lots of information on
all possible transitions. Converting the NFA to a Definite Finite
Automaton, DFA, and scanning
the text based on a DFA is faster and requires that only the
information of
the current state be maintained. The above reference details the
methods needed
to convert regular expressions to NFAs and the conversion of NFAs to
DFAs.
QTGrep follows those methods closely except for the conversion of NFAs
to
DFAs. The reference details a two step process for converting an NFA to
a
DFA and then optimizing the DFA. QTGrep converts the NFA to an
optimized DFA
in one step. The optimized DFA is then used for all text scanning. Four
possible
DFAs are created from the corresponding four NFAs:
- DFA for case sensitve forward scanning,
- DFA for case insensitive forward scanning,
- DFA for case sensitive backward scanning (qtgrepb only)
- DFA for case insensitive backward scanning (qtgrepb only)
QTGrep also departs from the reference in the use of linked lists for
the representation of the final DFAs instead of compressed tables.
Since only known transitions are allowed, the use of linked lists is
just as fast as a table lookup.
The construction of the backward scanning DFAs from the reversed NFAs
can consume more memory. This is a result of the fact that most, if not
all,
regular expression patterns have a direction implicit in their
construction.
Consider the simple regular expression:
^[a_][a9_]*(/\*.*\*/)?
Executing:
qtgrepb "^[a_][a9_]*(/\*.*\*/)?" -W test-simple.QTGrep.b.compiled
prnpat test_simple test-simple.QTGrep.b.compiled >
test-simple.QTGrep.b.c
will result in two "C" language files containg the structures defining
the DFA structures used by QTGrep for forward and backward scanning
using the simple regular expression. From this simple regular
expression, seven forward scanning DFA states are created and nine
backward scanning DFA states are created. The greater number of DFA
states for the backward scanning DFA is a direct result of the implicit
direction built into the construction of the
regular expression pattern. For this simple regular expression, the
difference is negligible.
Next consider the regular expression from the file "glbvars.exp" used
for recognizing functions in a "C" source file:
#e c_c (/\*.*\*/)
#e c_slc ({_w}*{c_c}{_w}*)?
#e c_n [A-Za-z_][A-Za-z0-9_]*
#e c_nb [A-Za-z_][A-Za-z0-9_\[\]]*
#e c_np \**{_w}*{c_nb}
#e c_ni [\*&]*{c_nb}
#e c_first_arg ({_w}*{c_ni}{_w}*)+
#e c_rem_arg (,{c_first_arg})*
#e c_arg_list {_w}*\(({c_first_arg}{c_rem_arg})?{_w}*\)
#e c_fname {c_n}({_w}+{c_np})+
^{c_fname}{c_arg_list}{c_slc}$
This regular expression will recognize functions all on one line in the
old K&R style definition of the language. This regular expression
will
result in a DFA with 30 states in the forward scanning DFA and 38
states
in the backward scanning DFA. The backward scanning DFA has 27% more
states. Adding the regular expression:
#e c_cname ({_w}*,{_w}*{c_np})*
^{c_fname}{c_cname};{c_slc}$
expands the number of states in the forward scanning DFA from 30 to 42
and from 38 to 65 in the backward scanning DFA. The backward scanning
DFA now has 55% more states. As the complexity of the regular
expressions increase, the expansion of the number of states in the
backward scanning DFA increases at a far greater rate than the number
of states in the forward scanning DFA. Consider all six regular
expressions in the file "glbvars.exp". These regular
expressions will find:
- Find All C Function Definitions - Old K&R Style Definition
- find function parameters for old K&R style definitions
- Find All C Function Definitions - New ANSI C Style Definition
- Find C Variable Declarations/Initializers
- find function prototypes - only difference from pattern #1,
ending ';'
- Function Block Delimitors "{}"
This file will generate a forward scanning DFA with 247 states and a
backward scanning DFA with 2375 states. The backward scanning DFA now
has 862% more states than the forward scanning DFA.
The above statitstics may be obtained for any regular expression file
by running QTGrep to write the internal form of the regular expressions
to a file, executing the auxilary program "prnpat" to convert the
internal form file to a "C" language file, and then running QTGrep
again with the "C" language file as input and using the following as
the Regular Expression file (available as 'count-states.exp'):
############ Start Regular Expression File
#o
#o Regular Expression file to obtain statistics on number of:
#o DFA states, transitions from states, tagged string structures
#o and character list strings both for case sensitive and insensitive
#o and forward and backward scans.
#o
#
# count number of forward search states, Case Sensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_CS_{_d}+;$
#m forward search states, Case Sensitive
# count number of forward search states, Case Insensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_CI_{_d}+;$
#m forward search states, Case Insensitive
# count number of backward search states, Case Sensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_r_CS_{_d}+;$
#m backward search states, Case Sensitive
# count number of backward search states, Case Insensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_r_CI_{_d}+;$
#m backward search states, Case Insensitive
# count number of character list strings - Case Sensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_ss_{_d}+_CS\[\]{_w}+={_w}+
#m character list strings - Case Sensitive
# count number of character list strings - Case Insensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_ss_{_d}+_CI\[\]{_w}+={_w}+
#m character list strings - Case Insensitive
# count number of forward transitions from states - Case Sensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_CS_{_d}+_{_d}+{_w}+={_w}+
#m transitions - forward, Case Sensitive
# count number of forward transitions from states - Case Insensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_CI_{_d}+_{_d}+{_w}+={_w}+
#m transitions - forward, Case Insensitive
# count number of backward transitions from states - Case Sensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_r_CS_{_d}+_{_d}+{_w}+={_w}+
#m transitions - backward, Case Sensitive
# count number of backward transitions from states - Case Insensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_r_CI_{_d}+_{_d}+{_w}+={_w}+
#m transitions - backward, Case Insensitive
# count number of tagged strings from forward transitions - Case
Sensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_CS_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, forward, Case Sensitive
# count number of tagged strings from forward transitions - Case
Insensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_CI_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, forward, Case Insensitive
# count number of tagged strings from backward transitions - Case
Sensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_r_CS_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, backward, Case Sensitive
# count number of tagged strings from backward transitions - Case
Insensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_r_CI_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, backward, Case Insensitive
############## End Regular Expression File
Running the following commands will provide
output on the desired statistics.
qtgrepb -f glbvars.exp -W glbvars.exp.QTGrep.b.compiled
prnpat glbvars glbvars.exp.QTGrep.b.compiled > glbvars.exp.QTGrep.b.c
qtgrep -csf count-states.exp glbvars.exp.QTGrep.b.c
The following output is obtained from this sequence of commands:
/home/terry/tmp/glbvars.exp.QTGrep.b.c: 103364
Total Matches: 103364
Pattern 1 Count: 247 forward search states, Case Sensitive
Pattern 2 Count: 247 forward search states, Case Insensitive
Pattern 3 Count: 2375 backward search states, Case Sensitive
Pattern 4 Count: 2375 backward search states, Case Insensitive
Pattern 5 Count: 10 character list strings - Case Sensitive
Pattern 6 Count: 4 character list strings - Case Insensitive
Pattern 7 Count: 879 transitions - forward, Case Sensitive
Pattern 8 Count: 879 transitions - forward, Case Insensitive
Pattern 9 Count: 20499 transitions - backward, Case Sensitive
Pattern 10 Count: 20499 transitions - backward, Case Insensitive
Pattern 11 Count: 1057 tagged strings, forward, Case Sensitive
Pattern 12 Count: 1057 tagged strings, forward, Case Insensitive
Pattern 13 Count: 26618 tagged strings, backward, Case Sensitive
Pattern 14 Count: 26618 tagged strings, backward, Case Insensitive
Total Records Matched: 103364
Total Records Scanned: 286677
Care should taken in the complexity of regular expressions used for
backward scanning.
© Terry D. Boldt 1997-2005
All Rights Reserved
Last Updated: Feb. 03, 2005