Regular expressions are a means of describing sequences of characters. In the discussion of QTAwk, character will be taken to mean any character from the extended ASCII sequence of characters from ASCII '1' to ASCII '255'.
A string is a finite sequence of characters. The length of a string is the number of string characters contained in the string. A special string is the empty string, also called the null string, which is of zero length, i.e., it contains no characters. We shall use the symbol 'e' below to refer to the null string.
Another way to think of a string is as the concatenation of a sequence of characters. Two strings may be concatenated to form another string. Concatenating the two strings:
abcdef
and
ghijklmn
forms the third string:
abcdefghijklmn
In many instances, it is desirable to describe a string with several alternatives for one or more of the characters. Thus we may wish to find the strings:
FRED
or
TED
A convenient manner of describing both strings with the same regular expression is
/(FR|T)ED/
Strings in QTAwk are enclosed in double quotes, ", and
regular
expressions are enclosed in slashes, '/'.
OR Operator, |
The symbol '|' means OR and so the above regular expression would be read as: The string FR OR the string T concatenated with the string ED. The parenthesis are used to group strings into equivalent positions in the resultant regular expression. In this manner it is possible to build a regular expression for several alternative strings.
In many instances it is also desirable to build regular expressions that contain many alternatives for one character, i.e., one character strings. For example, we may want to find all instances of the words doing or going. We could build the regular expression:
/(d|g)oing/
Character Lists, [...]
Although the last regular expression is a fairly simple example, it serves to introduce the notion of character list. If we define the notation:
[dg] = (d|e)
then we may write the regular expression as:
/[dg]oing/
The character list notation saves us from having to explicitly write the OR symbols in the regular expression. The OR is implied between each character of the list.
Now suppose that we wanted to expand our search to all five letter words ending in ing and starting with any lower case letter and having any lower case letter as the second character. We would write the regular expression:
/(a|b|c|d|...|x|y|z)(a|b|c|d|...|x|y|z)ing/
or
/[abcd...xyz][abcd...xyz]ing/
Regular expressions in these cases can not only get very long, but can be very tedious to write and are very prone to error. We introduce the notion of a range of characters into the character list and define:
[a-z] = [abcd...xyz] = (a|b|c|d|...|x|y|z)
The above regular expression can now be written:
/[a-z][a-z]ing/
A considerable savings and less error prone. The hyphen, '-', is recognized as expressing a range of characters only when it occurs within a character list. Within character lists, the hyphen loses this significance in the following three cases:
[-b] = (-|b)
[b-] = (b|-)
[z-a]
would be recognized as:
(z|-|a)
In interpreting the range notation in character lists, QTAwk uses the ASCII collating sequence.
[0-Z]
is equivalent to:
[0123456789:;<=>?@A-Z]
Character Class, [:alpha:]
If the possibiolity exists that the character set used in the documents to be scanned will be in various languages or character sets, then representing alphabetic characters as, "[A-Za-z]", would not work in many languages and character sets. The notion of character classes within character lists solves this problem. The notation:
[:alpha:]
within a character list denotes all alphabetic characters in the current character set. Note that this notation is only valid within a character list. Other character classes have been defined by the POSIX standard for dealing with varying character sets:
Continuing the last example, if we did not want to limit the first character to lower case, but also wanted to include the possibility of upper case letters, we could use the following regular expression:
/([A-Z]|[a-z])[a-z]ing/
This regular expression allows the first letter to be any character in the range from A to Z or in the range from a to z. But the OR is implied in character lists, shortening the above regular expression to:
/[A-Za-z][a-z]ing/
If we now wish to expand the above from all five letter words ending in ing to all six letter words ending in ing, we could write the regular expression as:
/[A-Za-z][a-z][a-z]ing/
In general, if we did not want to specify the number of characters between the first letter and the ing ending, we could specify an regular expression as:
/[A-Za-z](e|[a-z])(e|[a-z])...(e|[a-z])ing/
By specifying the null string in the OR regular expression, the regular expression allows a character in the range a to z or no character to match. The shortest string matched by this regular expression would be a single upper or lower case letter followed by ing. The regular expression would also match any string starting with an upper or lower case letter with any number of lower case letters following and ending in ing.
What we need to describe this regular expression is a notation for specifying zero or more copies of a character or string. Such a notation exists and is written as:
/[A-Za-z][a-z]*ing/
where the notation
[a-z]*
means zero or more occurrences of the character list [a-z]. This operation is called closure and the '*' is called the closure operator. In general, the notation may be used for any regular expression within a regular expression. The following are valid regular expressions using the notion of zero or more occurrences of an regular expression within another regular expression:
/mis*ion/
would match miion, mision, mission, misssion, missssion, etc.
/bot*om/
would match boom, botom, bottom, botttom, bottttom, etc.
/(Fr|T)*ed/
would match ed, Fred, Ted, FrFred, TTed, FrFrFred, TTTed, FrTFred, FrFrTed, TFrFred, etc.
As an extension to the '*' operator, we frequently would want to search for one or more occurrences of a regular expression. As above we would write this as:
/[A-Za-z][a-z][a-z]*ing/
The [a-z][a-z]* construct would ensure that at least one letter occurred between the initial letter and the string ing. This occurs often enough that the notation
[a-z]+ = [a-z][a-z]*
has been adopted to handle this situation. Thus, use the operator '*' for zero or more occurrences and the operator '+' for one or more occurrences. The '+' operator is called the positive closure operator.
In many cases it is desirable to search for either zero or one regular expression. For example, it would be desirable to search for names preceded by either Mr or Mrs The regular expression:
/Mrs*/
would find Mr, or Mrs, or Mrss, or Mrsss etc.
The following regular expression will accomplish what we really want in this case:
/Mr(e|s)/
This regular expression would find 'Mr' followed by zero or one 's'.
The operator '?' has been selected to denote zero or one of the preceding regular expression. Thus,
/Mrs?/ = /Mr(e|s)/
Repetition Operator, {n1,n2}
In some cases we wish to specify a minimum and maximum repeat count for a regular expression. For example, suppose it was desirable for a regular expression to contain a minimum of 2 and a maximum of 4 copies of abc. We could specify this as:
/abcabc(abc)?(abc)?/
The notation {2,4} has been adopted for expressing this. The general form of the repetition operator is {n1,n2}. n1 and n2 are integers, with n1 greater than or equal to 0 and n2 greater than or equal to n1, 0 <= n1 <= n2. A repetition count would be specified as:
/r{n1,n2}/ = /rrrrrrrrrrrrrrr?r?r?r?r?r?/
|<--- n1 ---->|
|<-------- n2 --------->|
The above could be expressed as:
/(abc){2,4}/ = /(abc)(abc)(abc)?(abc)?/
Since the repetition operator repeats the immediately preceding regular expression, the parenthesis around abc are necessary to repeat the whole string. Without the parenthesis the regular expression would expand as:
/abc{2,4}/ = /abccc?c?/
The repetition operator can be used to repeat either single characters, groups of characters, character lists or quoted strings. The use of the operator is illustrated below for each case:
/abc{2,4}/ = /abccc?c?/
/(abc){2,4}/ = /(abc)(abc)(abc)?(abc)?/
/[abc]{2,4}/ = /[abc][abc][abc]?[abc]?/
/"abc"{2,4}?/ = /"abcabc(abc)?(abc)?"/
For quoted strings, the whole of the string contained within quotes is repeated, with all repetitions maintained within the quotes.
/{abc}{2,4}/ = /{abc}{abc}({abc})?({abc})?/
or
r{n1}r*
or
r{n3}r+, with n3 = n1 - 1
r{0,n2} = r?r?r?r?r?r?
|<- n2 -->|
A special case exists for character lists in which the list of characters to exclude is greater than the list of characters to include. For example, suppose that we wanted, in a certain character position, to include all characters that weren't numerics. We could build a character list of all characters and leave the numerics out. An easier method is to use the complemented or negated character list. A special operator has been introduced for this purpose. The logical NOT symbol, '!', occurring as the first character in a character list, negates the list, i.e., any character NOT in the list is recognized at the character position.
Thus, to define the negated character list of all characters which are not numerics, we would specify:
[!0-9]
To define all characters except the semi-colon, we would specify:
[!;]
Note that the symbol '!' has this special meaning only as the FIRST character in a character list. The caret symbol, '^', as the FIRST character in a character list may also be used to negate a character list. Traditionally, the caret been used for this purpose, but QTAwk allows the logical NOT operator, '!' also.
Utilizing the above concepts for building regular expressions by
concatenating
characters, concatenating regular expressions to build more complicated
regular
expressions, using parenthesis to nest or group regular expressions
within
regular expressions, using character lists to denote constructs with
implied
ors, negated character lists to specify characters to exclude
from
a given position, using the closure operators, '*', '+' and '?', and
the
repetition operator, {n1,n2}, for expressing multiple copies, very
complicated
regular expressions may be built for searching for strings in files.
Escape Sequences, \c
To round out the ability for building regular expressions for searching, we need only a few more tools. In some cases we may wish for the regular expression to contain blanks or tab characters. In addition, other non-printable characters may be included in regular expressions. These characters are defined with escape sequences. Escape sequences are two or more characters used to denote a single character. The first character is always the backslash, '\'. The second character is by convention a letter as follows:
| Escape Sequence |
Character | Hexidecimal Value |
|---|---|---|
| \a | bell (alert) | ( \x07 ) |
| \b | backspace | ( \x08 ) |
| \f | formfeed | ( \x0c ) |
| \n | newline | ( \x0a ) |
| \r | carriage return | ( \x0d ) |
| \s | space (blank) | ( \x20 ) |
| \t | horizontal tab | ( \x09 ) |
| \v | vertical tab | ( \x0b ) |
| \c | c [ \\ == \ ] | |
| \ooo | character represented by octal value ooo 1 to 3 octal digits acceptable |
|
| \xhhh | character represented by hexadecimal value hhh 1 to 3 hexadecimal digits acceptable |
Any other character following the backslash is translated to mean
that
character. Thus \c would become a single c, \[
would become [, etc. The latter is necessary in order to
include
such characters as [, ], -, !, (,
), *, +, ? in regular
expressions without
invoking their special meanings as regular expression operators.
Position Operators, ^ . $
Three additional special characters have, by convention, been defined for use in writing regular expressions, namely the period ., the caret, ^ and the dollar sign, $. The period has been assigned to mean any character in the set of characters except the newline character, \n. For our use the period means any character from ASCII 1 to ASCII 9 inclusive and ASCII 11 to ASCII 255 inclusive and exclusive of the newline character, ASCII 10.
The caret and the dollar sign are position indicators and not character indicators. The caret, ^, is used to indicate the beginning or start of the search string. Thus, any character following the caret in a regular expression must be the first character of the string to be searched otherwise the match fails. The dollar sign , $, is used to indicate the end of the search string. Thus, any character preceding the dollar sign in a regular expression must be the last character of the string to be searched or the match fails.
To indicate beginning of line, the caret must be in the first character position of a regular expression. Similarly, to indicate end of line, the dollar sign must be in the last character position of a regular expression. In any other position, these characters lose their special significance. Thus, the regular expression:
/(^|[\s\t])A/
or
/([\s\t]|^)A/
means that 'A' must be the first character on a line, or be preceded by a space or tab character to match. Similarly
/A($|[\s\t])/
or
/A([\s\t]|$)/
means that 'A' must be the last character on a line or be followed
by
a space or tab character.
Examples
The regular expression:
/[A-Za-z][a-z]\s+.*/
will match an upper or lower case letter followed by a lower case letter followed by one or more blanks followed by any character except a newline zero or more times.
The regular expression:
/\([A-Z]+\)[!\s]+/
will match a left parenthesis followed by one or more upper case letters followed by a right parenthesis followed by one or more characters which are not blanks.
The regular expression:
/[\s\t]+ARCHIVE([\s\t]+|$)/
will match one or more blanks or tabs followed by the word (in upper case) ARCHIVE followed either by one or more blanks or tabs or by the end of line. Note this kind of construct is handy for finding words as independent units and not buried within other words.
The regular expression:
/([\s\t]+|$)/
is necessary to find words with trailing blanks or that end the search line. If only [\s\t]+ had been used then words ending the search line would not be found, since there are no trailing blanks or tabs.
Note that for files with the newline character, '\n', at the end of all lines, commonly called ASCII text files, it is possible to search for regular expressions that may span more than one line. For example, if we wanted to find all sequences of the names
Ted, Alice, George and Mary
that were separated by spaces, tabs or line boundaries, we would write the following regular expression:
/[\s\t\n]+Ted[\s\t\n]+Alice[\s\t\n]+Mary[\s\t\n]/
The regular expression:
/^As\s+(Fred|Ted|Jed|Ned)\s+(began|ended)(\s+|$)/
will match the beginning of the search line followed by As, i.e., 'A' as the first character of the search line, followed by 's', followed by one or more blanks followed by Fred or Ted or Jed or Ned followed by one or more blanks followed by began or ended followed by one or more blanks or the end of the search line. This could be modified slightly to be:
/^As\s+(Fr|T|J|N)ed\s+(began|ended)(\s+|$)/
or
/^As\s+(Fr|[TJN])ed\s+(began|ended)(\s+|$)/
either form will result in exactly the same search.
Look Ahead Operator, @
Sometimes it is necessary to find a regular expression, but only when it is followed by another regular expression. Thus we wish to find Mr, but only when it is followed by Smith. The look-ahead operator, '@', is used to denote this situation. In general, if r is a regular expression we wish to match, but only when followed by the regular expression s, then we would express this as:
/r@s/
Thus, to find Mr, but only when followed by Smith, we have:
/Mr@[\s\t]+Smith/
Match Lists, [#...]
There are also circumstances in which we wish to find pairs of characters. For example, we wish to find all clauses in a letter enclosed within parenthesis, (), braces, {}, or brackets, []. We could write several separate regular expressions which are identical except that one would use parenthesis, another braces, etc. A simpler method has been introduced using the concept of matched character lists. A matched character list is denoted as:
[#\(\{\[] and [#\)\}\]]
The first instance of a matched character list in a regular expression will match any character in the list. The second instance will match only the character in the position of the list matched by the first instance. For example, in the above two lists, if the character that matched the first list was '[', then only a ']' would match the second list and not a ')' or a '}'. Note the use of the backslash above to avoid any confusion in interpreting the characters (), {}, and [] as characters and regular expression operators. Except for ']', the backslash is not needed since the characters do not act as operators within a character list. For the character ']', the backslash is necessary to prevent early termination of the character list.
Note that matched character lists cannot be nested. Thus, the span of characters between two different matched character lists cannot overlap. If we wanted to find regular expressions contained within ([ and )] or within {[ and }], the instances of each in the regular expression could not overlap, i.e., we could NOT write a regular expression like:
this /[#\(\[] exp [#\{\[] contains [#\)\]] two [#\}\]] matched/
|<-------------------------------->| |
| |<-------------------------------->|
This regular expression would be interpreted as:
/this [#\(\[] exp [#\{\[] contains [#\)\]] two [#\}\]] matched/
|<--------------->| |<--------------->|
If the strings to be found using regular expressions are complicated, the associated regular expressions can become very difficult to understand. This makes it very hard to determine if the regular expression is correct. For example, the regular expression (as one line):
/^[A-Za-z_][A-Za-z0-9_]*([\s\t]+\**[A-Za-z_][A-Za-z0-9_]*)*
\((([\s\t]*[\*&]*[A-Za-z_][A-Za-z0-9_]*[\s\t]*)(,([\s\t]*
[\*&]*[A-Za-z_][A-Za-z0-9_]*[\s\t]*))*)*\)([\s\t]*
(\/\*.*\*\/)[\s\t]*)*$/
will find function definitions in C language programs. Constructing and analyzing this regular expression as a single entity, is difficult.
Breaking such regular expressions into smaller units, which are
shorter
and simpler, makes the task much easier. QTAwk has introduced
the
concept of
By defining a variable:
fst = "first words";
Then the following regular expression:
/The {fst} of the child/
would expand into:
/The first words of the child/
Named expressions allow for building up regular expressions from smaller more easily understood regular expressions and for re-using the smaller regular expressions. The following example QTAwk utility builds the previous regular expression for recognizing C language function definitions (all on one line) from many smaller regular expressions. Each constituent regular expression is built to recognize a particular part of the function definition. When combined into the final regular expression, the three parts of the definition can be easily understood. The final regular expression is expanded in the final print statement. It spans several 80 character lines and is much more difficult to understand due to its length and complexity.
Example:
BEGIN {
# Define regular expressions for finding functions in C
# source code.
#
# Functions of form: (K&R Style)
# int int_func(fparm, sparm, tparm) /* possible comment */
#
# Functions of form: (ANSI Style)
# int int_func( /* possible comment */
# unsigned int fparm, /* possible comment */
# char sparm, /* possible comment */
# char *tparm) /* possible comment */
#
# define variables for use in regular expressions:
#
# Define C Name Expression
c_n = /[A-Za-z_][A-Za-z0-9_]*/;
#
# Define C Name Expression with braces
c_nb = /[A-Za-z_][A-Za-z0-9_\[\]]*/;
#
# Define C Comment Expression
# Note: Does NOT Allow Comment To Span Lines
c_c = /(/\*.*\*/)/;
#
# Define Single Line Comment
c_slc = /({_w}*{c_c}{_w}*)?/;
#
# Define C Name With Pointer
c_np = /\**{c_nb}/;
#
# Define C Name With Pointer Or Address
c_ni = /[\*&]*{c_nb}/;
#
# Define C Function Type And Name Declaration
c_fname = /{c_n}({_w}+{c_np})+/;
#
# K & R Style Function Lists:
#
# Define Expression For First Argument In Function List
c_first_arg = /({_w}*{c_ni}{_w}*)+/;
#
# Define Expression For Remaining Argument In Function List
c_rem_arg = /(,{c_first_arg})*/;
#
# Define C Function Argument List
c_arg_list = /\(({c_first_arg}{c_rem_arg})?\)/;
#
# ANSI Style Function Lists:
#
# Define C Arguments - Type And Name
c_argn = /{_w}*{c_n}({_w}+{c_np})*/;
# Define expression for argument in function list
c_arg_ns = /({c_slc}\n{c_argn})/;
# Define C function argument list
c_arg_list_ns = /\(({c_arg_ns}(,{_w}*{c_arg_ns})*)?\)/;
#
# Expression To Find All C Function Definitions - K&R Style Definition
totl_name_KR = /^{c_fname}{c_arg_list}{c_slc}$/;
#
# Expression To Find All C Function Definitions - ANSI C Style Definition
totl_name_ANSI = /^{c_fname}{c_arg_list_ns}{c_slc}$/;
#
# print total expression to illustrate expansion of named
# expressions
# Refer to the description of the 'replace' function
#
print replace(totl_name_KR);
}
The string output by this utility is:
^[A-Za-z_][A-Za-z0-9_]*([\s\t]+\**[A-Za-z_][A-Za-z0-9_]*)*
\((([\s\t]*[\*&]*[A-Za-z_][A-Za-z0-9_]*[\s\t]*)(,([\s\t]*
[\*&]*[A-Za-z_][A-Za-z0-9_]*[\s\t]*))*)*\)([\s\t]*
(\/\*.*\*\/)[\s\t]*)*$
Note that in printing the regular expression, the leading and trailing slash, '/', were not printed.
The QTAwk utility glbvars.exp
is
a more complete version of the above utility.
Predefined Names, [A-Za-z]
In translating regular expressions, names starting with an underscore and followed by a single upper or lower case letter are reserved as predefined. The following predefined names are currently available for use in named expressions:
| Alphabetic | {_a} | == | [[:alpha:]] |
| Brackets | {_b} | == | [{}()[\]<>] |
| Control Character | {_c} | == | [[:cntrl:]] |
| Decimal Digit | {_d} | == | [[:digits:]] |
| Exponent | {_e} | == | [DdEe][-+]?{_d}{1,3} |
| Floating point number | {_f} | == | [-+]?({_d}+\.{_d}*|{_d}*\.{_d}+) |
| Floating, optional exponent | {_g} | == | {_f}({_e})? |
| Hexadecimal digit | {_h} | == | [[:xdigit:]] |
| decimal Integer | {_i} | == | [-+]?{_d}+ |
| lower-case alphabetic | {_l} | == | [[:lower:]] |
| upper-case alphabetic | {_m} | == | [[:upper:]] |
| alpha-Numeric | {_n} | == | [[:alnum:]] |
| Octal digit | {_o} | == | [0-7] |
| Punctuation | {_p} | == | [[:punct:]] |
| double or single Quote | {_q} | == | {_s}["'`] |
| Real number | {_r} | == | {_f}{_e} |
| zero or even number of Slashes | {_s} | == | (^|[!\\](\\\\)*) |
| printable character | {_t} | == | [[:print:]] |
| graphical character | {_u} | == | [[:graph:]] |
| White space | {_w} | == | [[:blank:]] |
| space, \t, \n, \v, \f, \r, \s | {_z} | == | [[:space:]] |
For {_f} and {_r}, the decimal point, '.', will be replaced by the decimal point appropriate for the current character set.
The above predefined names will take precedence over any variables
with
identical names in replacing named expressions in regular expressions
and
the replace function.
Tagged Strings, (...)
QTAwk recognizes and searches for regular expressions containing parenthesized regular expressions. QTAwk makes special use of the strings which match regular expressions contained within parenthesis. The strings matching regular expressions within parenthesis are called Tagged Strings and the QTAwk Tag Operator, [< >], is used to refer to tagged strings. The use of the tag operator to refer to tagged strings is explained in QTAwk expressions. The discussion here will explain how tagged strings are counted. It is important to understand how QTAwk counts tagged strings to use the tag operator.
A pair of numbers, ln and cn, are be used to label parenthesized regular expressions according to the nesting level, ln, and the count, cn, at a given level. There is no theoretical limit on the number of parenthesized regular expressions or the level to which the parenthesized regular expressions may be nested. However, while matching a regular expression, QTAwk only keeps track of tagged strings to a nesting level of 7, 1 <= ln <= 7, and a maximum count of 31, 1 <= cn <= 31, for each level. QTAwk can utilize regular expressions with parenthesis nested deeper than 7 and a count greater than 31 at each level, but for use with the tag operator, these limits apply.
The following examples illustrate the method for counting tagged strings. Tagged strings are identified with a pair of integers Tagged strings are counted according to the parenthesis set in the regular expression. Thus, the examples below show parenthesis nesting level and count using the regular expression and not the strings matching the regular expressions.
For the regular expression:
/[Tt]he matching ((string|digit)s (can|will))/
i == 1, j == 1
Nesting Level 1, First regular expression at this level
i == 2, j == 1
Nesting Level 2, First regular expression at this level
i == 2, j == 2
Nesting Level 2, Second regular expression at this level
Using the same regular expression, but omitting one set of parenthesis, we have:
/[Tt]he matching (string|digit)s (can|will)/
i == 1, j == 1
Nesting Level 1, First regular expression at this level
i == 1, j == 2
Nesting Level 1, Second regular expression at this level
Care must be used in determining parenthesis level and counts for regular expressions containing variable names, predefined or user defined. For example, the regular expression:
/({_f}|{_i}) (--> ([rfi]))/
matches floating point numbers with optional exponents or integers followed by
--> r or
--> f or
--> i
When the predefined names are expanded, the regular expression is /([-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)([DdEe][-+]?[0-9]([0-9])?([0-9])?)?|[-+]?[0-9]+) (--> ([rfi]))/
The parenthesis set at level 1, count 1 contains the string matching the floating point number or integer. The parenthesis set at level 2, count 1 contains the string matching the mantissa of the floating point number. The parenthesis set at level 2, count 2 contains the string matching the optional exponent of the floating point number (whether the exponent is present in the matching string or not). The parenthesis set
(--> ([rfi]))
is at level 1, count 2 and
([rfi])
is at level 2, count 3. The various levels and counts are listed below:
i == 1, j == 1, level 1, count 1
i == 2, j == 1, level 2, count 1 ---> mantissa portion of [-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)([DdEe][-+]?[0-9]([0-9])?([0-9])?)? expanded
i == 2, j == 2, level 2, count 2 ---> exponent portion of [-+]?([0-9]+\.[0-9]*|[0-9]*\.[0-9]+)([DdEe][-+]?[0-9]([0-9])?([0-9])?)? expanded
i == 3, j == 1 && 2, level 3, count 1 and 2
i == 1, j == 2, level 1, count 2
i == 1, j == 3, level 2, count 3
Thus, when named expressions, predefined or user defined, are
included
in regular expressions, care must be taken to account for any
parenthesis
set contained within the named variable when determining parenthesis
set
level and count for use with the tag operator.
Regular Expression Operator Summary
The QTAwk regular expression operators are summarized below:
| [:alpha:] | alphabetic characters |
| [:alnum:] | alphabetic and numeric |
| [:blank:] | blank and tab characters |
| [:cntrl:] | control characters |
| [:digit:] | numeric digits |
| [:graph:] | printable and visible characters |
| [:lower:] | lower-case alphabetic |
| [:print:] | printable characters |
| [:punct:] | punctuation characters |
| [:space:] | white space characters |
| [:upper:] | upper-case characters |
| [:xdigit:] | hexadecimal digit characters |
Special forms of repetition operator: