UNIX SHELL PROGRAMMING, FOURTH EDITION
Shell does 98 percent of what you will need for most prototyping and tool building.
Sometimes, however, you will need to write a custom filter. Although it's a bit afield
from Shell, lex -- the lexical analyzer -- can do wonderfully complex filtering,
allowing you to design and build your own custom filters.
Lex is a powerful tool for looking at text -- documents, C language, or data -- to either
transform or analyze it. Lex lets you build filters quickly and easily.
Lex lets you specify regular expressions (REs) for words, strings, or constants and then
generate a C language program from the lex source specifications. You can then execute the
compiled program to process the matched regular expressions. Lex acts as program generator --
it takes your specifications and generates the C language statements to lexically analyze input
text.
Programs written in lex can act as filters -- transforming
the input according to your rules (Figure X.1) -- or they can pass information about matched
text strings back to a calling C language module.
You can use lex to create filters if the change depends on
just the expressions found. If you need to know the grammar or syntax of the input
(e.g., a compiler or English analyzer) to process it, then a module that understands syntax
(e.g., yacc) should call lex. Lex then returns a value that the syntax analyzer can understand,
called a token, that it uses to examine the syntax.
Figure X.1 Lex Filters
Let's look at the structure of lex source shown in Figure X.2. First, in the definitions
section you can specify global data used within the rules and user routines.
Lex also has a number of internal tables. As the number of regular expressions grows, these
tables need to grow. The lex parameters allow you to specify sizes for these tables. Don't
worry about them initially; lex will let you know when you exceed one.
Then, in the rules section (within the %% delimiters),
you can specify up to 256 rules. Each rule consists of an RE to be matched and the actions to
take when a match occurs. For example, a simple lex filter might have the specifications
shown in Figure X.3. Given these rules, the resulting program takes three actions:
- When the analyzer program finds the word "lex", it prints "LEX" on standard output.
- When it finds a string of one or more alphabetic characters, the lex statement, ECHO,
prints the matched text on standard output.
- The program passes any unmatched characters onto standard output. This is the default
action for all unmatched input.
%{
/* global C language Data and Definitions
%}
lex parameters (%e, %p, etc.)
%%
RE { actions }
%%
C language subroutines called by actions
main( )
{
. . .
}
yywrap( )
{
. . .
}
Figure X.2 Detailed Lex Structure
%%
lex printf("%s", "LEX"); /* print uppercase LEX */
[A-Za-z]+ ECHO; /* match longer words like LEXical */
%%
Figure X.3 The Longest Match Rule
This example also highlights one of the major ambiguous
rules: lex always prefers the longest match. Without the second rule, the first one would
match the first three characters of words such as lexical; the output would be "LEXical".
These ambiguous rules are essential to the simple specifications of lexical programs.
The user routines section is a good place to put
actions that would otherwise clutter the rules sections. Write subroutines to handle
complex actions and call them from the rules section. Sometimes, you will want to perform
some processing following the lexical analysis. At end of input, the program
generated by lex automatically calls yywrap. You can code your own yywrap and include this
subroutine in the user routines section to handle any postprocessing or compile and link it
separately. It should return a value of one (1) for end of input or zero (0) if there's more
input.
Now, let's look at a couple more examples and compare them to Shell commands. The first example
is a simple translator from lower- to uppercase. The command to accomplish this is:
tr "[a-z]" "[A-Z]"
When a lexical analyzer program finds text that matches an
RE, it puts the
%{
#include
char *c; /* character pointer to matched text */
%}
%%
[a-z]+ {
/* convert matched lowercase to uppercase */
for(c=yytext; *c=toupper(*c) ;c++;
printf("%s", yytext);
}
%%
Figure X.4 A Filter to Translate Lowercase to Uppercase
%%
[Aa]m |
[Aa]re |
[Bb]e |
[Bb]een |
[Bb]eing |
[Ii]s |
[Ww]as |
[Ww]ere {
printf("\\fB%s\\fR", yytext); /* BOLD to Roman */
}
[A-Za-z]+ {
/* match longer words that contain the above
like care, his, aware */
ECHO; /* print them as is */
}
%%
Figure X.5 A Filter to Embolden Passive Verbs in nroff Text
matched text in an exeternal character array called yytext. Using yytext, an identical
translator could have been written as shown in Figure X.4.
The lex statement, ECHO, uses yytext to reproduce
matched input on standard output. ECHO is defined as printf("%s",yytext);.
Another example (Figure X.5) specifies a simple
filter to embolden passive verbs (is, was, were) in nroff text. The output could be piped
into nroff to aid the writer in finding passive verbs. In this example, rather than
repeating the action for each word found, the vertical bar (|) symbol works as a logical
OR to connect many regular expressions with the same action.
By now, you're probably wondering how this lex source
code becomes an executable program (Figure X.6). Suppose the lex code, for this example,
was in a
Figure X.6 Lex Filter Creation
CFLAGS = -O
OBJECTS = main.o lex.yy.o yywrap.o
LIBS = -ll
shellmet: $(OBJECTS)
cc $ (CFLAGS) $ (OBJECTS) $ (LIBS) -o shellmet
main.o: shellext.h
lex.yy.o: shellmet.h lex.yy.c
lex.yy.c: shellmet.l
yywrap.o: shellext.h
Figure X.7 Shellmet Makefile
file called passive.l. (The make command, Figure X.7, recognizes files with a .l
suffix as lexical analyzer code.) To generate, compile, link, and execute passive.l,
you would enter the following commands:
lex passive.l # Generate lex.yy.c
cc lex.yy.c -ll -o passive # Compile and link lex (-ll)
passive < textfile | nroff
The command lex passive.l creates a function
called yylex() in the file lex.yy.c. The yylex function will process the input according
to the specifications in passive.l. Next, you have to compile the generated C language
(lex.yy.c) and link in the lex library (-ll), which contains default main and yywrap routines.
#include
main(argc,argv)
int argc;
char **argv;
{ /* for all arguments */
for(argc--,argv++ ; argc > 0 ; argc--,argv++)
{ /* for all arguments */
if( freopen(*argv, "r", stdin) ) {
yylex(); /* call yylex */
} else { /* NULL */
fprintf(stderr, "Unable to open %s\n", *argv);
} /* END IF */
} /* END FOR */
}
Figure X.8 Listing of main Module That Calls yylex
Of course, redirecting standard input can be tiring.
To specify file names on the command line, you can write your own main module that calls yylex
(see Figure X.8). This module will reopen standard input and call yylex for each file on the
command line. The main module can do most of the preprocessing that any lexical analyzer
requires. Any postprocessing can be handled by yywrap.
Let's use these parts -- preprocessor, lexical analyzer, and postprocessor -- to write
a complexity analyzer for Shell script programs. Complexity is a function of the number of
decisions and size of a program. The verbs in the Shell programming language are case,
for, foreach, if, repeat, switch, until, and while. Let's write a lexical analyzer that
counts the occurences of each of these keywords and prints them if the decision count
exceeds 7 +/- 2 (one of the rigorously proven limits of human understanding).
We'll use the main program shown in Figure X.8.
Figure X.9 shows the lex source for the analyzer, shellmet.l. Figure X.10 shows
the include file for the keyword counters, and Figure X.11 shows the yywrap module
that will accumulate the decision count, compare it to a maximum value, print the
results, and reinitialize the counters. Figure X.12 shows the include file that
yywrap and main use to reference the external data declared by shellmet.l.
%{
#include "shellmet.h"
%}
%%
case case_cnt++;
for for_cnt++;
foreach foreach_cnt++;
if if_cnt++;
repeat repeat_cnt++;
switch switch_cnt++;
until until_cnt++;
while while_cnt++;
#[^\n]* comment_cnt++;
[\n] line_cnt++;
[\t] ;
[A-Za-z0-9_.]+ { ; /* match and delete anything else that
might contain the above (e.g. filenames) */
}
[^#A-Za-z0-9_.] { ; /* match and delete anything else
(e.g. numbers, punctuation) */
}
%%
Figure X.9 The Lex Code for the Analyzer
/* count decisions and lines */
int case_cnt =0;
int comment_cnt =0;
int for_cnt =0;
int foreach_cnt =0;
int if_cnt =0;
int repeat_cnt =0;
int switch_cnt =0;
int until_cnt =0;
int while_cnt =0;
int line_cnt =0;
#include
#include
Figure X.10 The include File for shellmet.1
Note that in Figure X.9 we had to make allowances for
Shell comments, any part of a line beginning with a nanogram (#) and ending before a newline
(\n). We don't want to count the strings case, for, foreach, if, repeat, switch, until, and
while if they occur inside a comment. Similarly, I had to define a regular expression,
[A-Za-z0-9._]+, to rule out the possibility of these words' occurring inside another word.
For example, consider a file name (e.g., case01 or foreign). It's simple to
create regular expressions for the words you want to find, but not so easy to create the REs
that represent larger strings that might include your keywords.
Also, since I didn't want any of the input to fall through
to standard output, I had to include rules for blanks, tabs [\t], and any other input
[^#A-Za-z0-9._\t\n] that prevents them from passing onto standard output. The circumflex (^)
causes the RE to match any character other than the ones within the square brackets.
I could have used a period (.), which matches any
character, instead of the more complex RE with the circumflex, but I prefer to specify REs
exactly to simplify future maintenance. For example, '.*' matches any number of occurences
of any character. Because of the ambiguity rule (lex always looks for the longest match),
this RE could easily override other REs that have been specified.
Lex also takes the first and longest match.
The first and longest rule takes precedence. Because of these two ambiguous rules,
the longest REs should be placed at the end of the rules section. Consider the order of the
following lex rules:
[a-z]+ identifier action;
for keyword action;
In this example, the first rule would always match the word
for because the first rule has precedence; the keyword action would never be
executed. If you have problems with a lexical analyzer, examine your lex code for ambiguity and
precedence violations.
yywrap()
{
#define MAX_DECISIONS 10
#define MAX_LINES 100
int decision_cnt = 0;
decision_cnt = case_cnt + for_cnt + foreach_cnt + if_cnt +
repeat_cnt + switch_cnt + until_cnt + while_cnt;
if(decision_cnt <= MAX_DECISIONS && line_cnt <= MAX_LINES) {
printf("Quality Okay!\n");
printf("Comment count = %3d\n", comment_cnt);
} else {
printf("Case count = %4d\n", case_cnt);
printf("For count = %4d\n", for_cnt);
printf("Foreach count = %4d\n", for_cnt);
printf("If count = %4d\n", if_cnt);
printf("Repeat count = %4d\n", Repeat_cnt);
printf("Switch count = %4d\n", switch_cnt);
printf("Until count = %4d\n", until_cnt);
printf("While count = %4d\n", while_cnt);
printf("Total = %4d\n", decision_cnt);
printf("\nTotal Lines = %4d\n", line_cnt);
if(decision_cnt > MAX_DECISION) {
printf("\nDecisions exceed quality standards\n");
}
if(line_cnt > MAX_LINES) {
printf("\nTotal Lines exceed quality standards\n");
} /* END IF */
} /* END IF */
case_cnt = 0; /* reinitialize the variables */
comment_cnt = 0;
for_cnt = 0;
foreach_cnt = 0;
if_cnt = 0;
repeat_cnt = 0;
switch_cnt = 0;
until_cnt = 0;
while_cnt = 0;
ine_cnt = 0;
return(1); /* end of input */
}
Figure X.11 Listing of the yywrap Module
extern int case_cnt;
extern int comment_cnt;
extern int for_cnt;
extern int foreach_cnt;
extern int if_cnt;
extern int repeat_cnt;
extern int switch_cnt;
extern int until_cnt;
extern int while_cnt;
extern int line_cnt;
Figure X.12 The shellext.h include File
If you're looking at code hour after hour, your mind can begin to miss key constructs and
keywords. To make these stand out on the printed page, we could create a lex filter that
emboldens the keywords in C language, Shell, or anything else for that matter. A simple
program, pretty_pr, to embolden these words would be:
{
#define BOLD printf("\fB%s\fR", yytext);
}
%%
if |
else |
switch |
case |
default |
until |
while |
[{ }] BOLD;
[a-z]+ ECHO;
%%
We could then embolden and print file.c as follows:
pretty_pr < file.c | nroff | lp
Let's look at some of the more infrequently used features of lex-input(),
output(), unput(), yyleng, yymore(), and yyless().
input()gets a character from stdin
output()writes a character on stdout
unput()puts a character back on stdin
yymore()looks for additional matching characters
yyless()trims characters from yytext and puts them back on stdin
The input() and unput() routines provide a "look-ahead" capability. The two routines,
yymore() and yyless(), tell lex to look for longer matches or cut characters from the matched
text. The lexical analyzer keeps the string length of yytext in an external integer variable
called yyleng. We could use these functions to analyze C language comments. In C language,
comments begin with the string "/*" and end with "*/". Figure X.13 shows a lex source program
to strip and print comments from C language.
The lex rule begins with "/*" and looks ahead for every
character up to, but not including, the next "/". Using C language, I then check for a leading
/* and a trailing * in yytext and add the trailing "/" using input(). Next, if yytext
begins with a /* but doesn't end with a "*", then the trailing "/" is embedded in the comment.
I use yymore() to expand the comment. Finally, if yytext begins with other than "/*", I
delete the matched text.
I should warn you that you can get into trouble if the
matched text gets too long. The yytext[] array can be up to 200 characters in length. It
isn't big enough to handle comments that span multiple sentences. To handle such problems,
you'll need a syntax analyzer.
%%
[/] [*] [^/] + {
if( yytext[yyleng - 1] == '*') { /* end comment */
yytext[yyleng++] = input(); /* get '/' */
yytext[yyleng] = NULL;
printf("%s\n", yytext);
} else {
yymore(); /* keep looking */
} /* END IF */
}
[^/]+ /* delete everything else */ ;
[/] ;
%%
Figure X.13 A Lex Program to Strip and Print Comments from C Language
#include
#include "shellext.h"
#include "shelltoken.h"
main (argc,argv) /* main for shellmet */
int argc;
char **argv;
{
int token = 0;
for(argc--,argv++ ; argc > 0 ; argc--,argv++)
{
if( freopen(*argv, "r", stdin) ) { /* successful */
while(( token = yylex() )) /* NULL at EOF */
{
switch(token)
{
case CASE:
case_cnt++; /* increment token counter */
break;
case COMMENT_START:
comment_cnt++; /* increment token counter */
/* delete tokens except terminating NEWLINE */
while(( token = yylex()) != NEWLINE);
/* fall through */
case NEWLINE:
line_cnt++; /* increment token counter */
break;
case FOR:
for_cnt++; /* increment token counter */
break;
case FOREACH:
foreach_cnt++; /* increment token counter */
break;
case IF:
if_cnt++; /* increment token counter */
break;
case REPEAT:
repeat_cnt++; /* increment token counter */
break;
case SWITCH:
switch_cnt++; /* increment token counter */
break;
case UNTIL:
until_cnt++; /* increment token counter */
break;
case WHILE:
while_cnt++; /* increment token counter */
break;
default:
break; /* ignore other tokens */
} /* END SWITCH */
} /* END WHILE */
} else {
fprintf(stderr, "Unable to open %s\n", *argv);
} /* END IF */
} /* END FOR */
}
Figure X.14 The main Module (Syntax Analyzer Version)
Let's use the previous example, the Shell complexity analyzer, to explain how to analyze
syntax. Figure X.14 shows the new main program and Figure X.15 shows the new lex program
that returns tokens to the main program. Figure X.16 shows the definition of the tokens.
The yywrap routine stays the same.
%{
#include "shelltoken.h"
#include "shellmet.h"
%}
%%
case return(CASE);
for return(FOR);
foreach return(FOREACH);
if return(IF);
repeat return(REPEAT);
switch return(SWITCH);
until return(UNTIL);
while return(WHILE);
[#] return(COMMENT_START);
[\n] return(NEWLINE);
[ \t] ;
[A-Za-z0-9_.]+ { /* match and delete anything else that
might contain the above */
}
[^#A-Za-z0-9_.] { ; /* match and delete anything else
(e.g. numbers, punctuation) */
}
%%
Figure X.15 Lex Code (Syntax Analyzer Version)
/* shell tokens */
#define CASE 1
#define FOR 2
#define FOREACH 3
#define IF 4
#define REPEAT 5
#define SWITCH 6
#define UNTIL 7
#define WHILE 8
#define COMMENT_START 9
#define NEWLINE 10
Figure X.16 The shelltoken.h include File
Notice how the yylex routine returns tokens to the main
program, which now tallies them. Also notice that "#" is defined as COMMENT_START and that
a NEWLINE is the end of a Shell comment.
In the main module, a return of COMMENT_START causes the
logic to loop until it finds a NEWLINE to end the comment. Then, it falls through and
increments the NEWLINE counter. As you can see, syntax analysis of language simplifies the
regular expressions used in the lex code and places the burden of analysis on the calling module.
If you need to get into complicated grammars or syntax,
yacc-- the parser generator-- generates programs from more easily understood syntax specifications.
The two, in tandem, are powerful tools to analyze text and its grammar or syntax.
I've covered the lexical analyzer and many of its robust features in just enough detail to
make you dangerous. I recommend reading the documentation on lex to discover its other
capabilities. To write your own lexical analyzer, see Kernighan and Plauger (1976). For a
more thorough understanding of parsers and lexical analyzers, I recommend Aho et al. (1985).
I hope this chapter has given you some understanding of
lex and how to use it to build filters and syntax analyzers. And I hope you are intrigued by
the joy of lex.
1. Write a lexical analyzer to embolden all of the control keywords in C language: if, else,
switch, while, until, and {}.
2. Expand the Shell metrics analyzer to count all files and variables.
|
|
ISBN 0471168947
Wiley
Computer Publishing
Timely. Practical. Reliable.
|