Let's first take a grand tour of all the regular expression
operators. The table below shows all of the regular expression character operators along with
a brief description of how they operate. Any character that is not in this list and is
used in a regular expression stands for itself and nothing else. They are often called ordinary
characters. For example all of the alphabetic and numeric characters stand for themselves
when used in a regular expression.
In this section we will start to explore the building of regular expressions by examining
simple regular expression patterns. The point of showing these somewhat trivial examples
is to ensure that the difference between ordinary and special characters is understood.
The following string is a regular expression that consists of the character q and would match
just the character q and nothing else.
q
While this seems trivial it shows that q is an ordinary character that can represent a
regular expression. Simple regular expression consist of no special operators and simply
represent themselves. Further the regular expression
quit
is a simple regular expression that would match the characters q, u, i, t in succession. Any
character that is not listed as a special operator character above can occur in a simple
regular expression and will stand for itself and nothing else. For example
column; row%
is a simple regular expression that in essence matches the string "column;row%". If for
some reason we need to include a special character as part of a simple regular expression then
we can do so by preceding it with the escape character \. This character means that the next
character should not be considered a special character but simply an ordinary character in the
regular expression. The following example would match the string ABC*
ABC\*
It should also be noted that a space does participate in a regular expression as an ordinary
character that must be matched by a space.
A simple regular expression can stand alone as shown here
but is often combined or concatenated with regular expression operators and other simple regular
expressions to form complex regular expression. This is of course the real power of regular
expressions and deserve close inspection.
A period is special character in regular expressions that will match any single character.
You use this in a regular expression anytime you want a single character to occur but it does
not matter what that character is. For example if we wanted to match all three character
strings that start with the letter r and end with the letter n the following regular expression
would do the trick.
r.n
This would of course match the strings run, ran, ron which form valid words but would also
match any of a long list of more nonsensical character strings. Strings such as r1n,
rxn, r&n as well as longer strings which contain r.n as a substring such as ronald or rink.
The point here is not to overlook the fact that the period matches any character in the ASCII
character set not just numbers and letters.
As another example let's say that we were filtering a file
that contained five character parts codes. We want to match any part code that has the character
Z as the third character. The regular expression that would accomplish this task could be
written as
..z..
This regular expression would of course match any character string 5 characters long that had
a Z as the third character. In this example we will assume that our file only contained part
numbers and no other strings. If were not the case this regular expression could get us into
trouble if we did not use care to isolate just the part number.
The left bracket starts the definition of a character set in a regular expression. Any
characters that occur between the left bracket and the right bracket are considered part of
the set. The regular expression matches if any of the characters in the set occur in the
string being examined. As an example let's return to our three character string that begins
with r and ends with n. We saw previously that the period special character gave us many
matches. If our intention were just to match three character words that start with r and end
with n we might utilize the [] construct in our regular expression to help us narrow our
matches. We might begin by assuming that we should narrow our search to just lower case
alphabetic characters. We can do this using a character class. Of course we could list
every lower case character between the brackets but luckily we can use a shorthand notation to
represent this. The - is used to represent a range of characters. So the range [a-z]
represents [abcdefghi...z] and thus forms our desired regular expression representing all
lower case alphabetic characters. Now getting back to our example the full regular expression
would be
r[a-z]n
This would eliminate any string that did not have a lower case alphabetic character in the
second position. But clearly this can still match lot's of nonsense words (rbn, rcn etc.).
Well, if we assumed that a vowel need to occur in the second position then we could further
limit our regular expression by placing just the vowels in the our character set. For example
r[aeiou]n
would match more closely with our intentions of matching words that contain a substring that
starts with r and ends with n(ran, run, ron, ronald, rink). Placing a space after the n in
the regular expression would make it more closely match just three character words that start
with r and end with n followed by a space.
The [] construct is very powerful at limiting the scope of
regular expression matching to particular characters or sets of characters. The range operator -
can specify any sequence of characters as long as they are continuous in the ASCII sequence.
For example [A-C] represents just the uppercase characters ABC. There are several ranges that
are commonly used and are listed below:
[A-Z] - all upper case alphabetic characters.
[a-z] - all lower case alphabetic characters.
[0-9] - all digits characters
[A-Za-z] - all alphabetic characters
In addition to the range operator there is another operator that has special meaning within
the brackets. If the first character after the left bracket is a ^ then the complement of the
character set defined between the brackets is matched. This means that any character that
is not in the set defined will be matched. For example the regular expression
r[^A-Z]n
would match any three character string that started with r and ended with n and did not contain
an uppercase alphabetic character in the second position. This operator can be applied to the
common ranges developed above to form other very useful ranges. For example:
[^A-Za-z] - match all non alphabetic characters
One last point concerning the [] construct. All characters
that occur inside the brackets are ordinary. The only characters that have special meaning
within the [] are the range operator - and the ^ complement operator. These of course can
stand for themselves if preceded by an \ escape character.
If the ^ character occurs outside of the character class operator, [], and at the start of a
regular expression then the expression that follows must match at the beginning of the line.
For a very simple example consider the following regular expression:
^Windy
This regular expresion would only match the characters "Windy" if they occurred at the
beginning of a line. All other occurrences of "Windy" would not match. Matching the
beginning of a line can be a very useful tool when editing files and is a frequently used
regular expression in sed and ed. The ^ character alone matches just the beginning of a
line and can be used to insert information onto the front of a line using sed. For example
the regular expression:
^
will match the start of every line in the file.
But the ^ character can really occur before any regular expression. Let's return to our
previous example where we wanted to match a part number located in a file. If you recall
this was accomplished using the regular expression:
..z..
As we pointed out then this regular expression could match on any character string in the file
the contained a Z in the third position. We were safe as long as we just had part numbers in
the file. Let's assume instead that the part number was at the beginning of each line in the
file. Then the following regular expression, using the ^ special character, could limit the
matching performed to just the start of the line:
^..z..
You will find the ^ operator to be very useful in forming regular expressions since key
information is often located at the start of a line of data. In addition it is often easy to
arrange it so that important information is located at the start of a line.
Like the ^ special character the $ character following a regular expression is used to cause
a match to occur only if the preceding regular expression occurs at the end of a line. For
example:
END$
would match only if the string "END" occurred at the end of a line. All other occurrences of
the string "END" would be ignored. As with ^ the $ character can be used after any valid
regular expression to force the match to occur only at the end of the line. For example
consider this example where we want to match the word end but in any case mixture:
[E,e] [N,n] [D,d]$
This is often a good technique for matching user input where the input can be in any case and
you want to look simply for the word regardless of case.
Consider the following example which uses both the ^ and
the $ regular expression characters.
^$
This regular expression will match all blank lines.
The * character is used to match zero or more of the preceding character or
regular expression. Note that this is different from the file name generation *
symbol which says place any string in this position (this is accomplished in
regular expressions by the formation . * which is any character repeated any
number of times). The * in regular expressions is simply a repeat symbol. The
preceding character can repeat any number of times, including zero times
, and still match. As an example consider the following regular expression
ZA*P
This regular expression would match the string ZP as well as the string
ZAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP
If we wanted to assure that at least one occurrence of the A were to appear
in the string we could write the regular expression
ZAA*P
which would at a minimum match ZAP.
Of course the * special character can be
used after any regular expression to indicate that it is to repeat zero or
more times. For example here are three widely used regular expression
involving *.
A-Za-z][A-Za-z]* - matches any string of characters(almost matches
all words)
[+\-][0-9][0-9]* - matches any integer with a preceding + or -
.* - matches any string of characters
^ *$ - matches lines that have only spaces
Note that in the last example a .* alone will match the entire line since
regular expression match the longest possible string if any ambiguity exists.
This is a good way to match the all the lines in a file.
The * character introduced in the last section is very powerful but provides
no control over how many occurrences of a character is considered valid as a
match. It can range from 0 to some very large number of occurrences. Sometimes
we would like to have a little more control when building regular expressions.
The \{m,n\} construct provides this control. This construct provides three
ways to control matching of repeating characters. These are
1. Character must repeat within a range specified by m and n with m being
the minimum number of times the character can repeat and n the maximum.
This has the form \{m,n\}.
2. Character must repeat at least m number of times. This has the form
\{m,\}.
3. Character must repeat exactly m number of times. This has the form
\{m\}. These forms often provide the control that we need when forming
complex regular expressions. For example consider the example from the
last section which described integers
[+\-][0-9][0-9]*
Actually this forces our integers to have a preceding + or - sign. This is
not often the case when trying to match integers in general which may or may
not have a + or - sign in front. But without the \{m,n\} construct it would
be very difficult to write a regular expression to match the general case
integer. Using [+\-]* would simply not work obvious reasons. But now that
we can control the number of occurrences the following regular expression
should do the trick.
[+\-]\{0,1\}[0-9][0-9]*- match any legal integer expression
Likewise we could now form an expression that would match any real or
integer number by using the following regular expression
[+\-]\{0,1\}[0-9][0-9]*\.\{0,1\}[0-9][0-9]*
- matches any real / integer decimal expression
As another example let's return to our
part number and assume that the first two positions in the part number are
alphabetic and the final two positions are numeric. Then our part number
would look something like AAZ23. We may want to match a part number
anywhere it is found and can assume that part numbers always have this
form. Using the repeating control construct we can form a regular expression
which matches part numbers as follows
[A-Z]\{2\}Z[0-9]\{2\}
The first part of the expression, [A-Z]\{2\}, says that we must have exactly
two occurrences of uppercase alphabetic character followed by a Z which of
course is followed by two numeric characters.
Save a Match and Compare Later Using \(...\) and \n
The \(...\) construct provides the ability to save a matched string in
memory for comparison later in the regular expression. Any regular expression
can be placed between the \( \) parenthesis. Each time that a string is stored
using this construct it is assigned a number 1 through 9 based on it's position
in the regular expression. These stored expressions can then be referenced
later in the same regular expression by using the \n construct in the place
of a regular expression. For example let's suppose that we had a file which
contained lines that were divided into fields based on the field delimiter
":". There are four fields on each line. We want to only select lines where
the first field matches the third field and the second field matches the fourth.
The following regular expression should do the trick and demonstrates how saved
expressions are assigned numbers
\(.*\):\(.*\):\1:\2:
The first field is matched and stored as \1 using \(.*\): which says to match any
character string up to the first: which is our field delimiter. The second field
is stored as \2 by the second occurrence of \(.*\):. Finally we recall the
saved patterns by referencing \1 in the third field position and \2 in the
fourth field position. If the third field matches the pattern saved for the
first field and the fourth field matches the pattern saved for the second
field then the line matches.
Saving matched patterns is a very useful
tool especially when using sed and ed which allow the saved pattern to be used
as a replacement value when editing a file. Note that this construct \( ...\)
is not available when using awk.
Creating More Complex Regular Expressions
In the previous sections we have developed several examples used to demonstrate
each of the regular expression special characters. As we progressed we developed
increasingly complex expressions. As you can imagine regular expressions are
very powerful and can match or recognize a wide variety of strings. This is
done by concatenating smaller regular expressions into longer strings. We saw
this in expression to match an integer or real decimal was given by
[+\-]\{0,1\}[0-9][0-9]*\.\{0,1\}[0-9][0-9]*
This is of course a concatenation of several smaller regular expressions. We
can extend this concatenation into a wide array of complex regular expressions
that will match all kinds of classes of strings. In fact regular expressions
are used in computer language compilers to help recognize valid syntactic
components of the language.
Egrep and Awk Extensions to Regular Expressions
In addition to the regular expression characters listed above egrep and awk
extend the capabilities of regular expressions by adding several other regular
expression characters. These are outlined in the following table. Note that
these are only available in egrep and awk and cannot be used in other tools
such as sed and ed. As was mentioned above the \( ... \) construct is not
part of the egrep/awk regular expression special characters.
+ Match the previous character one or more times. This is different from
the * operator because zero occurrences do not match.
? Match the previous character zero or one time only.
| The OR operator which means to match either regular expression pattern
occurring on either side of the OR symbol. reg_exp1|reg_exp2
( ) A regular expression grouper that can be used to group entire regular
expressions. Aids in removing ambiguities in complex regular expressions
The + operator is used much like the * operator described above except that
the + operator does not consider zero occurrences to be a match. The preceding
character must occur at least once. This is a very convenient operator since
there are often situations where a we want to assure that the regular
expression character occurs at least once. In the previous sections we saw
examples of regular expressions that describe words and integers. In each of
these regular expressions we had to take measures to assure that the a letter
or digit occurred at least once as is shown in the following
[A-Za-z][A-Za-z]*
By using the + operator we can simplify the expression to be the following
[A-Za-z]+
This implies that at least a single alphabetic character must be found in order
for a match to occur.
We could simulate the behavior of the +
character using the \{1,\} construct which implies that the preceding character
must occur at least 1 time.
Similar to the * and + operator the ? matches zero or 1 occurrence of the
previous character. This can be thought of as a special case of the \{m,n\}
construct where m is 0 and n is 1. The ? operator is provided because the
need to specify zero or one occurrence arises frequently when developing
regular expressions. In our previous example of a regular expression that
matched integer or real number decimal representations we looked for 0 or
one occurrences of a + and - sign and a decimal point. This regular
expressions was as follows:
[+\-]\{0,1\}[0-9][0-9]*\.\{0,1\}[0-9][0-9]*
Using the ? and the + operator as was demonstrated in the previous section
we can simplify this regular expression to the following
[+\-]?[0-9]+\.?[0-9]+
The | is an or operator that can be used to specify two full regular expression
either of which can match to cause a match of the regular expression. The syntax
for using the or operator is
reg_exp_1|reg_exp_2
which means either reg_exp1 or reg_exp2 can match to cause an overall match.
For example the simple regular expression
RED|TED
would match either the string RED or the string TED. This can be useful
when we want to match several conditions in a single position in a regular
expression.