Comments on Usage.

QTGrep and QTGrepb have been designed to scan for regular expressions and keywords.

The keyword scan uses one of three algorithms:
  1.  Brute force: Used for single keywords only. Method: Scan for first letter of single keyword and then match remaining characters of keyword to scanned text. This method used for single keywords of three characters or less only. This method fastest for these keywords as per reference: "Information Retrieval, Data Structures and Algorithms"
    Ed. William B. Frakes & Ricardo Baeza-Yates
    1992 Prentice Hall
  2.  Boyer-Moore-Horspool Algorithm. Used for single keywords greater than three characters in length. This method fastest as per above reference. Algorithm detailed in above reference or:
    "Compilers, Principles, Techniques and Tools"
    by: Alfred V. Aho,
    Ravi Sethi, and
    Jeffrey D. Ullman
    Published by: Addison-Wesley, Copyright 1986
  3.  Failure Trie as detailed in "Compilers, Principles, Techniques and Tools" reference above, page 151. This method used exclusively for multiple keywords.
The Regular Expression scan uses a single algorithm for all scans. The method is described in the reference: "Compilers, Principles, Techniques and Tools". The method first converts all regular expressions into a Non-definite Finite Automaton, NFA. For Case Insensitive searches, the NFA is then converted such that all characters are in lower case. For backward searches the NFAs are then reversed. Four distinct NFAs are obtained thusly. All NFAs have the same number of states and transitions. Unfortunately, scanning text based on NFAs is slow and requires maintaining lots of information on all possible transitions. Converting the NFA to a Definite Finite Automaton, DFA, and scanning the text based on a DFA is faster and requires that only the information of the current state be maintained. The above reference details the methods needed to convert regular expressions to NFAs and the conversion of NFAs to DFAs. QTGrep follows those methods closely except for the conversion of NFAs to DFAs. The reference details a two step process for converting an NFA to a DFA and then optimizing the DFA. QTGrep converts the NFA to an optimized DFA in one step. The optimized DFA is then used for all text scanning. Four possible DFAs are created from the corresponding four NFAs:
  1.  DFA for case sensitve forward scanning,
  2.  DFA for case insensitive forward scanning,
  3.  DFA for case sensitive backward scanning (qtgrepb only)
  4.  DFA for case insensitive backward scanning (qtgrepb only)
QTGrep also departs from the reference in the use of linked lists for the representation of the final DFAs instead of compressed tables. Since only known transitions are allowed, the use of linked lists is just as fast as a table lookup.

The construction of the backward scanning DFAs from the reversed NFAs can consume more memory. This is a result of the fact that most, if not all, regular expression patterns have a direction implicit in their construction. Consider the simple regular expression:

^[a_][a9_]*(/\*.*\*/)?

Executing:

qtgrepb "^[a_][a9_]*(/\*.*\*/)?" -W test-simple.QTGrep.b.compiled
prnpat test_simple test-simple.QTGrep.b.compiled > test-simple.QTGrep.b.c

will result in two "C" language files containg the structures defining the DFA structures used by QTGrep for forward and backward scanning using the simple regular expression. From this simple regular expression, seven forward scanning DFA states are created and nine backward scanning DFA states are created. The greater number of DFA states for the backward scanning DFA is a direct result of the implicit direction built into the construction of the regular expression pattern. For this simple regular expression, the difference is negligible.

Next consider the regular expression from the file "glbvars.exp" used for recognizing functions in a "C" source file:

#e c_c (/\*.*\*/)
#e c_slc ({_w}*{c_c}{_w}*)?
#e c_n [A-Za-z_][A-Za-z0-9_]*
#e c_nb [A-Za-z_][A-Za-z0-9_\[\]]*
#e c_np \**{_w}*{c_nb}
#e c_ni [\*&]*{c_nb}
#e c_first_arg ({_w}*{c_ni}{_w}*)+
#e c_rem_arg (,{c_first_arg})*
#e c_arg_list {_w}*\(({c_first_arg}{c_rem_arg})?{_w}*\)
#e c_fname {c_n}({_w}+{c_np})+
^{c_fname}{c_arg_list}{c_slc}$

This regular expression will recognize functions all on one line in the old K&R style definition of the language. This regular expression will result in a DFA with 30 states in the forward scanning DFA and 38 states in the backward scanning DFA. The backward scanning DFA has 27% more states.  Adding the regular expression:

#e c_cname ({_w}*,{_w}*{c_np})*
^{c_fname}{c_cname};{c_slc}$

expands the number of states in the forward scanning DFA from 30 to 42 and from 38 to 65 in the backward scanning DFA. The backward scanning DFA now has 55% more states. As the complexity of the regular expressions increase, the expansion of the number of states in the backward scanning DFA increases at a far greater rate than the number of states in the forward scanning DFA. Consider all six regular expressions in the file "glbvars.exp". These regular expressions will find:

  1. Find All C Function Definitions - Old K&R Style Definition
  2.  find function parameters for old K&R style definitions
  3. Find All C Function Definitions - New ANSI C Style Definition
  4. Find C Variable Declarations/Initializers
  5. find function prototypes - only difference from pattern #1, ending ';'
  6. Function Block Delimitors "{}"
This file will generate a forward scanning DFA with 247 states and a backward scanning DFA with 2375 states. The backward scanning DFA now has 862% more states than the forward scanning DFA.

The above statitstics may be obtained for any regular expression file by running QTGrep to write the internal form of the regular expressions to a file, executing the auxilary program "prnpat" to convert the internal form file to a "C" language file, and then running QTGrep again with the "C" language file as input and using the following as the Regular Expression file (available as 'count-states.exp'):
############ Start Regular Expression File
#o
#o Regular Expression file to obtain statistics on number of:
#o DFA states, transitions from states, tagged string structures
#o and character list strings both for case sensitive and insensitive
#o and forward and backward scans.
#o
#
#  count number of forward search states, Case Sensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_CS_{_d}+;$
#m forward search states, Case Sensitive

#  count number of forward search states, Case Insensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_CI_{_d}+;$
#m forward search states, Case Insensitive

#  count number of backward search states, Case Sensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_r_CS_{_d}+;$
#m backward search states, Case Sensitive

#  count number of backward search states, Case Insensitive
^static{_w}+state_rep{_w}+[A-Za-z0-9]+_r_CI_{_d}+;$
#m backward search states, Case Insensitive

# count number of character list strings - Case Sensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_ss_{_d}+_CS\[\]{_w}+={_w}+
#m character list strings - Case Sensitive

# count number of character list strings - Case Insensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_ss_{_d}+_CI\[\]{_w}+={_w}+
#m character list strings - Case Insensitive

# count number of forward transitions from states - Case Sensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_CS_{_d}+_{_d}+{_w}+={_w}+
#m transitions - forward, Case Sensitive

# count number of forward transitions from states - Case Insensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_CI_{_d}+_{_d}+{_w}+={_w}+
#m transitions - forward, Case Insensitive

# count number of backward transitions from states - Case Sensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_r_CS_{_d}+_{_d}+{_w}+={_w}+
#m transitions - backward, Case Sensitive

# count number of backward transitions from states - Case Insensitive
^static{_w}+state_trns{_w}+[A-Za-z0-9]+_r_CI_{_d}+_{_d}+{_w}+={_w}+
#m transitions - backward, Case Insensitive

#  count number of tagged strings from forward transitions - Case Sensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_CS_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, forward, Case Sensitive

#  count number of tagged strings from forward transitions - Case Insensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_CI_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, forward, Case Insensitive

#  count number of tagged strings from backward transitions - Case Sensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_r_CS_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, backward, Case Sensitive

#  count number of tagged strings from backward transitions - Case Insensitive
^static{_w}+unsigned{_w}+char{_w}+[A-Za-z0-9]+_r_CI_{_d}+_{_d}+_{_d}+\[\]
#m tagged strings, backward, Case Insensitive
############## End Regular Expression File

Running the following commands will provide output on the desired statistics.

qtgrepb -f glbvars.exp -W glbvars.exp.QTGrep.b.compiled
prnpat glbvars glbvars.exp.QTGrep.b.compiled > glbvars.exp.QTGrep.b.c
qtgrep -csf count-states.exp glbvars.exp.QTGrep.b.c

The following output is obtained from this sequence of commands:
/home/terry/tmp/glbvars.exp.QTGrep.b.c:       103364

Total         Matches:       103364
  Pattern     1 Count:          247 forward search states, Case Sensitive
  Pattern     2 Count:          247 forward search states, Case Insensitive
  Pattern     3 Count:         2375 backward search states, Case Sensitive
  Pattern     4 Count:         2375 backward search states, Case Insensitive
  Pattern     5 Count:           10 character list strings - Case Sensitive
  Pattern     6 Count:            4 character list strings - Case Insensitive
  Pattern     7 Count:          879 transitions - forward, Case Sensitive
  Pattern     8 Count:          879 transitions - forward, Case Insensitive
  Pattern     9 Count:        20499 transitions - backward, Case Sensitive
  Pattern    10 Count:        20499 transitions - backward, Case Insensitive
  Pattern    11 Count:         1057 tagged strings, forward, Case Sensitive
  Pattern    12 Count:         1057 tagged strings, forward, Case Insensitive
  Pattern    13 Count:        26618 tagged strings, backward, Case Sensitive
  Pattern    14 Count:        26618 tagged strings, backward, Case Insensitive
Total Records Matched:       103364
Total Records Scanned:       286677

Care should taken in the complexity of regular expressions used for backward scanning.


© Terry D. Boldt 1997-2005
All Rights Reserved
Last Updated: Feb. 03, 2005