wiley-logo-sm.gif
> wiley.com

UNIX SHELL PROGRAMMING, FOURTH EDITION

Appendix Z - REGULAR EXPRESSIONS

Throughout UNIX SHELL PROGRAMMING, Fourth Edition, we have been using UNIX tools that use regular expressions. A regular expression could better be described as a pattern matching expression. Regular expressions are formed by using letters and numbers in conjunction with special characters that act as operators. They can greatly aid in the ability to find and filter information in files. The most common UNIX tools that utilize regular expressions are ed, sed, awk, the various forms of grep, and the emacs editor (an even richer set of regular expression operators). In addition you will find other tools that utilize a more limited form of regular expressions. An example of this is filename generation in Shell. While it does not support a full implementation of regular expressions you will see that several of the regular expression operators are at work (unfortunately the implementation is not completely consistent-- note * and ? below). Another example of a tool that supports regular expressions is the pg command. In this section we will cover each of the regular expression operators and learn how to build regular expressions using them. You will find this information to be applicable in many areas of UNIX.
Let's first take a grand tour of all the regular expression operators. The table below shows all of the regular expression character operators along with a brief description of how they operate. Any character that is not in this list and is used in a regular expression stands for itself and nothing else. They are often called ordinary characters. For example all of the alphabetic and numeric characters stand for themselves when used in a regular expression.

Simple Regular Expressions

In this section we will start to explore the building of regular expressions by examining simple regular expression patterns. The point of showing these somewhat trivial examples is to ensure that the difference between ordinary and special characters is understood. The following string is a regular expression that consists of the character q and would match just the character q and nothing else.

q

While this seems trivial it shows that q is an ordinary character that can represent a regular expression. Simple regular expression consist of no special operators and simply represent themselves. Further the regular expression

quit

is a simple regular expression that would match the characters q, u, i, t in succession. Any character that is not listed as a special operator character above can occur in a simple regular expression and will stand for itself and nothing else. For example

column; row%

is a simple regular expression that in essence matches the string "column;row%". If for some reason we need to include a special character as part of a simple regular expression then we can do so by preceding it with the escape character \. This character means that the next character should not be considered a special character but simply an ordinary character in the regular expression. The following example would match the string ABC*

ABC\*

It should also be noted that a space does participate in a regular expression as an ordinary character that must be matched by a space.
A simple regular expression can stand alone as shown here but is often combined or concatenated with regular expression operators and other simple regular expressions to form complex regular expression. This is of course the real power of regular expressions and deserve close inspection.



Matching Any Single Character Using .

A period is special character in regular expressions that will match any single character. You use this in a regular expression anytime you want a single character to occur but it does not matter what that character is. For example if we wanted to match all three character strings that start with the letter r and end with the letter n the following regular expression would do the trick.

r.n

This would of course match the strings run, ran, ron which form valid words but would also match any of a long list of more nonsensical character strings. Strings such as r1n, rxn, r&n as well as longer strings which contain r.n as a substring such as ronald or rink. The point here is not to overlook the fact that the period matches any character in the ASCII character set not just numbers and letters.
As another example let's say that we were filtering a file that contained five character parts codes. We want to match any part code that has the character Z as the third character. The regular expression that would accomplish this task could be written as

..z..

This regular expression would of course match any character string 5 characters long that had a Z as the third character. In this example we will assume that our file only contained part numbers and no other strings. If were not the case this regular expression could get us into trouble if we did not use care to isolate just the part number.



Matching Sets of Characters Using [ ]

The left bracket starts the definition of a character set in a regular expression. Any characters that occur between the left bracket and the right bracket are considered part of the set. The regular expression matches if any of the characters in the set occur in the string being examined. As an example let's return to our three character string that begins with r and ends with n. We saw previously that the period special character gave us many matches. If our intention were just to match three character words that start with r and end with n we might utilize the [] construct in our regular expression to help us narrow our matches. We might begin by assuming that we should narrow our search to just lower case alphabetic characters. We can do this using a character class. Of course we could list every lower case character between the brackets but luckily we can use a shorthand notation to represent this. The - is used to represent a range of characters. So the range [a-z] represents [abcdefghi...z] and thus forms our desired regular expression representing all lower case alphabetic characters. Now getting back to our example the full regular expression would be

r[a-z]n

This would eliminate any string that did not have a lower case alphabetic character in the second position. But clearly this can still match lot's of nonsense words (rbn, rcn etc.). Well, if we assumed that a vowel need to occur in the second position then we could further limit our regular expression by placing just the vowels in the our character set. For example

r[aeiou]n

would match more closely with our intentions of matching words that contain a substring that starts with r and ends with n(ran, run, ron, ronald, rink). Placing a space after the n in the regular expression would make it more closely match just three character words that start with r and end with n followed by a space.
The [] construct is very powerful at limiting the scope of regular expression matching to particular characters or sets of characters. The range operator - can specify any sequence of characters as long as they are continuous in the ASCII sequence. For example [A-C] represents just the uppercase characters ABC. There are several ranges that are commonly used and are listed below:

[A-Z] - all upper case alphabetic characters.

[a-z] - all lower case alphabetic characters.

[0-9] - all digits characters

[A-Za-z] - all alphabetic characters

In addition to the range operator there is another operator that has special meaning within the brackets. If the first character after the left bracket is a ^ then the complement of the character set defined between the brackets is matched. This means that any character that is not in the set defined will be matched. For example the regular expression

r[^A-Z]n

would match any three character string that started with r and ended with n and did not contain an uppercase alphabetic character in the second position. This operator can be applied to the common ranges developed above to form other very useful ranges. For example:

[^A-Za-z] - match all non alphabetic characters

One last point concerning the [] construct. All characters that occur inside the brackets are ordinary. The only characters that have special meaning within the [] are the range operator - and the ^ complement operator. These of course can stand for themselves if preceded by an \ escape character.



Matching the Start of a Line Using ^

If the ^ character occurs outside of the character class operator, [], and at the start of a regular expression then the expression that follows must match at the beginning of the line. For a very simple example consider the following regular expression:

^Windy

This regular expresion would only match the characters "Windy" if they occurred at the beginning of a line. All other occurrences of "Windy" would not match. Matching the beginning of a line can be a very useful tool when editing files and is a frequently used regular expression in sed and ed. The ^ character alone matches just the beginning of a line and can be used to insert information onto the front of a line using sed. For example the regular expression:

^

will match the start of every line in the file.
But the ^ character can really occur before any regular expression. Let's return to our previous example where we wanted to match a part number located in a file. If you recall this was accomplished using the regular expression:

..z..

As we pointed out then this regular expression could match on any character string in the file the contained a Z in the third position. We were safe as long as we just had part numbers in the file. Let's assume instead that the part number was at the beginning of each line in the file. Then the following regular expression, using the ^ special character, could limit the matching performed to just the start of the line:

^..z..

You will find the ^ operator to be very useful in forming regular expressions since key information is often located at the start of a line of data. In addition it is often easy to arrange it so that important information is located at the start of a line.



Matching the End of a Line Using $

Like the ^ special character the $ character following a regular expression is used to cause a match to occur only if the preceding regular expression occurs at the end of a line. For example:

END$

would match only if the string "END" occurred at the end of a line. All other occurrences of the string "END" would be ignored. As with ^ the $ character can be used after any valid regular expression to force the match to occur only at the end of the line. For example consider this example where we want to match the word end but in any case mixture:

[E,e] [N,n] [D,d]$

This is often a good technique for matching user input where the input can be in any case and you want to look simply for the word regardless of case.
Consider the following example which uses both the ^ and the $ regular expression characters.

^$

This regular expression will match all blank lines.

Matching Zero or More Characters Using*

The * character is used to match zero or more of the preceding character or regular expression. Note that this is different from the file name generation * symbol which says place any string in this position (this is accomplished in regular expressions by the formation . * which is any character repeated any number of times). The * in regular expressions is simply a repeat symbol. The preceding character can repeat any number of times, including zero times , and still match. As an example consider the following regular expression

ZA*P

This regular expression would match the string ZP as well as the string

ZAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP
If we wanted to assure that at least one occurrence of the A were to appear in the string we could write the regular expression

ZAA*P

which would at a minimum match ZAP.
Of course the * special character can be used after any regular expression to indicate that it is to repeat zero or more times. For example here are three widely used regular expression involving *.

A-Za-z][A-Za-z]*      - matches any string of characters(almost matches 
all words)
[+\-][0-9][0-9]*      - matches any integer with a preceding + or -
.*                    - matches any string of characters
^ *$                  - matches lines that have only spaces 

Note that in the last example a .* alone will match the entire line since regular expression match the longest possible string if any ambiguity exists. This is a good way to match the all the lines in a file.

Matching a Specified Number of Characters Using \{m,n\}

The * character introduced in the last section is very powerful but provides no control over how many occurrences of a character is considered valid as a match. It can range from 0 to some very large number of occurrences. Sometimes we would like to have a little more control when building regular expressions. The \{m,n\} construct provides this control. This construct provides three ways to control matching of repeating characters. These are

1. Character must repeat within a range specified by m and n with m being the minimum number of times the character can repeat and n the maximum. This has the form \{m,n\}.

2. Character must repeat at least m number of times. This has the form \{m,\}.

3. Character must repeat exactly m number of times. This has the form \{m\}. These forms often provide the control that we need when forming complex regular expressions. For example consider the example from the last section which described integers

[+\-][0-9][0-9]*

Actually this forces our integers to have a preceding + or - sign. This is not often the case when trying to match integers in general which may or may not have a + or - sign in front. But without the \{m,n\} construct it would be very difficult to write a regular expression to match the general case integer. Using [+\-]* would simply not work obvious reasons. But now that we can control the number of occurrences the following regular expression should do the trick.

[+\-]\{0,1\}[0-9][0-9]*- match any legal integer expression

Likewise we could now form an expression that would match any real or integer number by using the following regular expression

[+\-]\{0,1\}[0-9][0-9]*\.\{0,1\}[0-9][0-9]* - matches any real / integer decimal expression

As another example let's return to our part number and assume that the first two positions in the part number are alphabetic and the final two positions are numeric. Then our part number would look something like AAZ23. We may want to match a part number anywhere it is found and can assume that part numbers always have this form. Using the repeating control construct we can form a regular expression which matches part numbers as follows

[A-Z]\{2\}Z[0-9]\{2\}

The first part of the expression, [A-Z]\{2\}, says that we must have exactly two occurrences of uppercase alphabetic character followed by a Z which of course is followed by two numeric characters.

Save a Match and Compare Later Using \(...\) and \n

The \(...\) construct provides the ability to save a matched string in memory for comparison later in the regular expression. Any regular expression can be placed between the \( \) parenthesis. Each time that a string is stored using this construct it is assigned a number 1 through 9 based on it's position in the regular expression. These stored expressions can then be referenced later in the same regular expression by using the \n construct in the place of a regular expression. For example let's suppose that we had a file which contained lines that were divided into fields based on the field delimiter ":". There are four fields on each line. We want to only select lines where the first field matches the third field and the second field matches the fourth. The following regular expression should do the trick and demonstrates how saved expressions are assigned numbers

\(.*\):\(.*\):\1:\2:

The first field is matched and stored as \1 using \(.*\): which says to match any character string up to the first: which is our field delimiter. The second field is stored as \2 by the second occurrence of \(.*\):. Finally we recall the saved patterns by referencing \1 in the third field position and \2 in the fourth field position. If the third field matches the pattern saved for the first field and the fourth field matches the pattern saved for the second field then the line matches.
Saving matched patterns is a very useful tool especially when using sed and ed which allow the saved pattern to be used as a replacement value when editing a file. Note that this construct \( ...\) is not available when using awk.

Creating More Complex Regular Expressions

In the previous sections we have developed several examples used to demonstrate each of the regular expression special characters. As we progressed we developed increasingly complex expressions. As you can imagine regular expressions are very powerful and can match or recognize a wide variety of strings. This is done by concatenating smaller regular expressions into longer strings. We saw this in expression to match an integer or real decimal was given by

[+\-]\{0,1\}[0-9][0-9]*\.\{0,1\}[0-9][0-9]*

This is of course a concatenation of several smaller regular expressions. We can extend this concatenation into a wide array of complex regular expressions that will match all kinds of classes of strings. In fact regular expressions are used in computer language compilers to help recognize valid syntactic components of the language.

Egrep and Awk Extensions to Regular Expressions

In addition to the regular expression characters listed above egrep and awk extend the capabilities of regular expressions by adding several other regular expression characters. These are outlined in the following table. Note that these are only available in egrep and awk and cannot be used in other tools such as sed and ed. As was mentioned above the \( ... \) construct is not part of the egrep/awk regular expression special characters.

Table I.1 Awk Extensions to Regular Expression Operators

Character Description

+   Match the previous character one or more times. This is different from 
    the * operator because zero occurrences do not match. 
?   Match the previous character zero or one time only.  
|   The OR operator which means to match either regular expression pattern
    occurring on either side of the OR symbol.  reg_exp1|reg_exp2
( ) A regular expression grouper that can be used to group entire regular 
    expressions. Aids in removing ambiguities in complex regular expressions

Matching One of More Characters Using +

The + operator is used much like the * operator described above except that the + operator does not consider zero occurrences to be a match. The preceding character must occur at least once. This is a very convenient operator since there are often situations where a we want to assure that the regular expression character occurs at least once. In the previous sections we saw examples of regular expressions that describe words and integers. In each of these regular expressions we had to take measures to assure that the a letter or digit occurred at least once as is shown in the following

[A-Za-z][A-Za-z]*

By using the + operator we can simplify the expression to be the following

[A-Za-z]+

This implies that at least a single alphabetic character must be found in order for a match to occur.
We could simulate the behavior of the + character using the \{1,\} construct which implies that the preceding character must occur at least 1 time.

Matching Zero or One Occurrence of a Character Using ?

Similar to the * and + operator the ? matches zero or 1 occurrence of the previous character. This can be thought of as a special case of the \{m,n\} construct where m is 0 and n is 1. The ? operator is provided because the need to specify zero or one occurrence arises frequently when developing regular expressions. In our previous example of a regular expression that matched integer or real number decimal representations we looked for 0 or one occurrences of a + and - sign and a decimal point. This regular expressions was as follows:

[+\-]\{0,1\}[0-9][0-9]*\.\{0,1\}[0-9][0-9]*

Using the ? and the + operator as was demonstrated in the previous section we can simplify this regular expression to the following

[+\-]?[0-9]+\.?[0-9]+

Matching Either of Two Regular Expression Using |

The | is an or operator that can be used to specify two full regular expression either of which can match to cause a match of the regular expression. The syntax for using the or operator is
reg_exp_1|reg_exp_2

which means either reg_exp1 or reg_exp2 can match to cause an overall match. For example the simple regular expression
RED|TED
would match either the string RED or the string TED. This can be useful when we want to match several conditions in a single position in a regular expression.

Cover

ISBN 0471168947

Wiley Computer Publishing
Timely. Practical. Reliable.

[ Home ] [ Appendix X - The Shell Filter Builder ] [ Appendix Y - Nroff and Troff ] [ Appendix Y - Nroff and Troff - continued ] [ Appendix Z - Regular Expressions ]