[an error occurred while processing this directive] [an error occurred while processing this directive]

UNIX SHELL PROGRAMMING, FOURTH EDITION

Appendix X- The Shell Filter Builder

Shell does 98 percent of what you will need for most prototyping and tool building. Sometimes, however, you will need to write a custom filter. Although it's a bit afield from Shell, lex -- the lexical analyzer -- can do wonderfully complex filtering, allowing you to design and build your own custom filters.
Lex is a powerful tool for looking at text -- documents, C language, or data -- to either transform or analyze it. Lex lets you build filters quickly and easily.
Lex lets you specify regular expressions (REs) for words, strings, or constants and then generate a C language program from the lex source specifications. You can then execute the compiled program to process the matched regular expressions. Lex acts as program generator -- it takes your specifications and generates the C language statements to lexically analyze input text.
Programs written in lex can act as filters -- transforming the input according to your rules (Figure X.1) -- or they can pass information about matched text strings back to a calling C language module.
You can use lex to create filters if the change depends on just the expressions found. If you need to know the grammar or syntax of the input (e.g., a compiler or English analyzer) to process it, then a module that understands syntax (e.g., yacc) should call lex. Lex then returns a value that the syntax analyzer can understand, called a token, that it uses to examine the syntax.

Figure X.1 Lex Filters

LEX SOURCE STRUCTURE

Let's look at the structure of lex source shown in Figure X.2. First, in the definitions section you can specify global data used within the rules and user routines. Lex also has a number of internal tables. As the number of regular expressions grows, these tables need to grow. The lex parameters allow you to specify sizes for these tables. Don't worry about them initially; lex will let you know when you exceed one.
Then, in the rules section (within the %% delimiters), you can specify up to 256 rules. Each rule consists of an RE to be matched and the actions to take when a match occurs. For example, a simple lex filter might have the specifications shown in Figure X.3. Given these rules, the resulting program takes three actions:

  1. When the analyzer program finds the word "lex", it prints "LEX" on standard output.
  2. When it finds a string of one or more alphabetic characters, the lex statement, ECHO, prints the matched text on standard output.
  3. The program passes any unmatched characters onto standard output. This is the default action for all unmatched input.

%{
/* global C language Data and Definitions
%}

lex parameters (%e, %p, etc.)

%%
RE { actions }
%%

C language subroutines called by actions
main( )
{
. . .
}
yywrap( )
{
. . .
}

Figure X.2 Detailed Lex Structure

%%

lex           printf("%s", "LEX");  /* print uppercase LEX */

[A-Za-z]+ ECHO;                     /* match longer words like LEXical */

%%

Figure X.3 The Longest Match Rule

This example also highlights one of the major ambiguous rules: lex always prefers the longest match. Without the second rule, the first one would match the first three characters of words such as lexical; the output would be "LEXical". These ambiguous rules are essential to the simple specifications of lexical programs.
The user routines section is a good place to put actions that would otherwise clutter the rules sections. Write subroutines to handle complex actions and call them from the rules section. Sometimes, you will want to perform some processing following the lexical analysis. At end of input, the program generated by lex automatically calls yywrap. You can code your own yywrap and include this subroutine in the user routines section to handle any postprocessing or compile and link it separately. It should return a value of one (1) for end of input or zero (0) if there's more input.





LEX FILTER PROGRAMS

Now, let's look at a couple more examples and compare them to Shell commands. The first example is a simple translator from lower- to uppercase. The command to accomplish this is:

tr "[a-z]" "[A-Z]"

When a lexical analyzer program finds text that matches an RE, it puts the



          %{
          #include 
          char *c;  /* character pointer to matched text */
          %}
          %%
          [a-z]+  {
                        /* convert matched lowercase to uppercase */
                        for(c=yytext; *c=toupper(*c) ;c++;
                        printf("%s", yytext);
                     }
          %%

Figure X.4 A Filter to Translate Lowercase to Uppercase




     %%
     [Aa]m          |
     [Aa]re         |
     [Bb]e          |
     [Bb]een        |
     [Bb]eing       |
     [Ii]s          |
     [Ww]as         |
     [Ww]ere        {
                           printf("\\fB%s\\fR", yytext);  /* BOLD to Roman */
                    }
     [A-Za-z]+      {
                           /* match longer words that contain the above
                                like care, his, aware */
                           ECHO;     /* print them as is */
                    }
     %%

Figure X.5 A Filter to Embolden Passive Verbs in nroff Text




matched text in an exeternal character array called yytext. Using yytext, an identical translator could have been written as shown in Figure X.4. The lex statement, ECHO, uses yytext to reproduce matched input on standard output. ECHO is defined as printf("%s",yytext);.
Another example (Figure X.5) specifies a simple filter to embolden passive verbs (is, was, were) in nroff text. The output could be piped into nroff to aid the writer in finding passive verbs. In this example, rather than repeating the action for each word found, the vertical bar (|) symbol works as a logical OR to connect many regular expressions with the same action.
By now, you're probably wondering how this lex source code becomes an executable program (Figure X.6). Suppose the lex code, for this example, was in a

Figure X.6 Lex Filter Creation




               CFLAGS       = -O
               OBJECTS      = main.o lex.yy.o yywrap.o
               LIBS         = -ll
               shellmet:    $(OBJECTS)
                            cc $ (CFLAGS) $ (OBJECTS) $ (LIBS) -o shellmet
               main.o:      shellext.h
               lex.yy.o:    shellmet.h lex.yy.c
               lex.yy.c:    shellmet.l
               yywrap.o:    shellext.h

Figure X.7 Shellmet Makefile






file called passive.l. (The make command, Figure X.7, recognizes files with a .l suffix as lexical analyzer code.) To generate, compile, link, and execute passive.l, you would enter the following commands:

lex passive.l                    # Generate lex.yy.c
    cc lex.yy.c -ll -o passive       # Compile and link lex (-ll)
    passive < textfile | nroff

The command lex passive.l creates a function called yylex() in the file lex.yy.c. The yylex function will process the input according to the specifications in passive.l. Next, you have to compile the generated C language (lex.yy.c) and link in the lex library (-ll), which contains default main and yywrap routines.




                #include

                main(argc,argv)
                int argc;
                char **argv;
                {        /* for all arguments */

                     for(argc--,argv++ ; argc > 0 ; argc--,argv++)
                     {   /* for all arguments */

                         if( freopen(*argv, "r", stdin) ) {
                              yylex();                      /* call yylex */
                          } else {                          /* NULL */
                             fprintf(stderr, "Unable to open %s\n", *argv);
                          }    /* END IF */

                     }     /* END FOR */
                }

Figure X.8 Listing of main Module That Calls yylex



Of course, redirecting standard input can be tiring. To specify file names on the command line, you can write your own main module that calls yylex (see Figure X.8). This module will reopen standard input and call yylex for each file on the command line. The main module can do most of the preprocessing that any lexical analyzer requires. Any postprocessing can be handled by yywrap.





A SHELL QUALITY ANALYZER

Let's use these parts -- preprocessor, lexical analyzer, and postprocessor -- to write a complexity analyzer for Shell script programs. Complexity is a function of the number of decisions and size of a program. The verbs in the Shell programming language are case, for, foreach, if, repeat, switch, until, and while. Let's write a lexical analyzer that counts the occurences of each of these keywords and prints them if the decision count exceeds 7 +/- 2 (one of the rigorously proven limits of human understanding).
We'll use the main program shown in Figure X.8. Figure X.9 shows the lex source for the analyzer, shellmet.l. Figure X.10 shows the include file for the keyword counters, and Figure X.11 shows the yywrap module that will accumulate the decision count, compare it to a maximum value, print the results, and reinitialize the counters. Figure X.12 shows the include file that yywrap and main use to reference the external data declared by shellmet.l.





     %{
     #include "shellmet.h"
     %}
     %%
     case      case_cnt++;
     for       for_cnt++;
     foreach   foreach_cnt++;
     if        if_cnt++;
     repeat    repeat_cnt++;
     switch    switch_cnt++;
     until          until_cnt++;
     while          while_cnt++;
     #[^\n]*   comment_cnt++;
     [\n]            line_cnt++;
     [\t]            ;
     [A-Za-z0-9_.]+  { ;  /* match and delete anything else that
                           might contain the above (e.g. filenames) */
                     }
     [^#A-Za-z0-9_.]       { ;  /* match and delete anything else
                                 (e.g. numbers, punctuation) */
                     }
     %%

Figure X.9 The Lex Code for the Analyzer




                         /* count decisions and lines */

                         int case_cnt          =0;
                         int comment_cnt       =0;
                         int for_cnt           =0;
                         int foreach_cnt       =0;
                         int if_cnt            =0;
                         int repeat_cnt        =0;
                         int switch_cnt        =0;
                         int until_cnt         =0;
                         int while_cnt         =0;
                         int line_cnt          =0;

                         #include 
                         #include 

Figure X.10 The include File for shellmet.1




Note that in Figure X.9 we had to make allowances for Shell comments, any part of a line beginning with a nanogram (#) and ending before a newline (\n). We don't want to count the strings case, for, foreach, if, repeat, switch, until, and while if they occur inside a comment. Similarly, I had to define a regular expression, [A-Za-z0-9._]+, to rule out the possibility of these words' occurring inside another word. For example, consider a file name (e.g., case01 or foreign). It's simple to create regular expressions for the words you want to find, but not so easy to create the REs that represent larger strings that might include your keywords.
Also, since I didn't want any of the input to fall through to standard output, I had to include rules for blanks, tabs [\t], and any other input [^#A-Za-z0-9._\t\n] that prevents them from passing onto standard output. The circumflex (^) causes the RE to match any character other than the ones within the square brackets.
I could have used a period (.), which matches any character, instead of the more complex RE with the circumflex, but I prefer to specify REs exactly to simplify future maintenance. For example, '.*' matches any number of occurences of any character. Because of the ambiguity rule (lex always looks for the longest match), this RE could easily override other REs that have been specified.
Lex also takes the first and longest match. The first and longest rule takes precedence. Because of these two ambiguous rules, the longest REs should be placed at the end of the rules section. Consider the order of the following lex rules:

[a-z]+ identifier action;
for keyword action;

In this example, the first rule would always match the word for because the first rule has precedence; the keyword action would never be executed. If you have problems with a lexical analyzer, examine your lex code for ambiguity and precedence violations.

     yywrap()
     {

     #define MAX_DECISIONS 10
     #define MAX_LINES    100
     int decision_cnt = 0;

     decision_cnt = case_cnt + for_cnt + foreach_cnt + if_cnt +
          repeat_cnt + switch_cnt + until_cnt + while_cnt;

     if(decision_cnt <= MAX_DECISIONS && line_cnt <= MAX_LINES) {
          printf("Quality Okay!\n");
          printf("Comment count = %3d\n", comment_cnt);
     } else {
          printf("Case count = %4d\n", case_cnt);
          printf("For count = %4d\n", for_cnt);
          printf("Foreach count = %4d\n", for_cnt);
          printf("If count = %4d\n", if_cnt);
          printf("Repeat count = %4d\n", Repeat_cnt);
          printf("Switch count = %4d\n", switch_cnt);
          printf("Until count = %4d\n", until_cnt);
          printf("While count = %4d\n", while_cnt);
          printf("Total = %4d\n", decision_cnt);
          printf("\nTotal Lines = %4d\n", line_cnt);

          if(decision_cnt > MAX_DECISION) {
           printf("\nDecisions exceed quality standards\n");
          }
          if(line_cnt > MAX_LINES) {
           printf("\nTotal Lines exceed quality standards\n");
          }   /* END IF */
     }    /* END IF */

     case_cnt        = 0;   /* reinitialize the variables */
     comment_cnt     = 0;
     for_cnt         = 0;
     foreach_cnt     = 0;
     if_cnt          = 0;
     repeat_cnt      = 0;
     switch_cnt      = 0;
     until_cnt       = 0;
     while_cnt       = 0;
     ine_cnt         = 0;

     return(1);  /* end of input */
     }

Figure X.11 Listing of the yywrap Module



                           extern int case_cnt;
                           extern int comment_cnt;
                           extern int for_cnt;
                           extern int foreach_cnt;
                           extern int if_cnt;
                           extern int repeat_cnt;
                           extern int switch_cnt;
                           extern int until_cnt;
                           extern int while_cnt;
                           extern int line_cnt;

Figure X.12 The shellext.h include File



A PRETTY FILTER

If you're looking at code hour after hour, your mind can begin to miss key constructs and keywords. To make these stand out on the printed page, we could create a lex filter that emboldens the keywords in C language, Shell, or anything else for that matter. A simple program, pretty_pr, to embolden these words would be:

     {
     #define BOLD printf("\fB%s\fR", yytext);
     }
     %%
     if |
     else |
     switch |
     case |
     default |
     until |
     while |
     [{ }] BOLD;
     [a-z]+ ECHO;
     %%

We could then embolden and print file.c as follows:

    pretty_pr < file.c | nroff | lp



OTHER LEX ROUTINES

Let's look at some of the more infrequently used features of lex-input(), output(), unput(), yyleng, yymore(), and yyless().

input()gets a character from stdin

output()writes a character on stdout

unput()puts a character back on stdin

yymore()looks for additional matching characters

yyless()trims characters from yytext and puts them back on stdin

The input() and unput() routines provide a "look-ahead" capability. The two routines, yymore() and yyless(), tell lex to look for longer matches or cut characters from the matched text. The lexical analyzer keeps the string length of yytext in an external integer variable called yyleng. We could use these functions to analyze C language comments. In C language, comments begin with the string "/*" and end with "*/". Figure X.13 shows a lex source program to strip and print comments from C language.
The lex rule begins with "/*" and looks ahead for every character up to, but not including, the next "/". Using C language, I then check for a leading /* and a trailing * in yytext and add the trailing "/" using input(). Next, if yytext begins with a /* but doesn't end with a "*", then the trailing "/" is embedded in the comment. I use yymore() to expand the comment. Finally, if yytext begins with other than "/*", I delete the matched text.
I should warn you that you can get into trouble if the matched text gets too long. The yytext[] array can be up to 200 characters in length. It isn't big enough to handle comments that span multiple sentences. To handle such problems, you'll need a syntax analyzer.





     %%
     [/] [*] [^/] +     {
                  if(  yytext[yyleng - 1] == '*')  { /* end comment */

                        yytext[yyleng++] = input();  /* get '/' */
                        yytext[yyleng] = NULL;
                        printf("%s\n", yytext);

                  } else {

                        yymore();  /* keep looking */

                  }  /* END IF */
            }
     [^/]+              /* delete everything else */ ;
     [/]          ;
     %%

Figure X.13 A Lex Program to Strip and Print Comments from C Language



          #include 
          #include "shellext.h"
          #include "shelltoken.h"

          main (argc,argv)           /* main for shellmet */
          int argc;
          char **argv;
          {
          int token = 0;
             for(argc--,argv++ ; argc > 0 ; argc--,argv++)
             {
                if( freopen(*argv, "r", stdin) ) { /* successful */

                    while(( token = yylex() ))     /* NULL at EOF */
                     {
                        switch(token)
                         {
                            case CASE:
                               case_cnt++;  /* increment token counter */
                                break;
                            case COMMENT_START:
                               comment_cnt++;  /* increment token counter */
                               /* delete tokens except terminating NEWLINE */
                               while(( token = yylex()) != NEWLINE);
                                   /* fall through */
                            case NEWLINE:
                               line_cnt++;  /* increment token counter */
                                break;
                            case FOR:
                               for_cnt++;  /* increment token counter */
                                break;
                            case FOREACH:
                               foreach_cnt++;  /* increment token counter */
                                break;
                            case IF:
                               if_cnt++;  /* increment token counter */
                                break;
                            case REPEAT:
                               repeat_cnt++;  /* increment token counter */
                                break;
                            case SWITCH:
                               switch_cnt++;  /* increment token counter */
                                break;
                            case UNTIL:
                               until_cnt++;  /* increment token counter */
                                break;
                            case WHILE:
                               while_cnt++;  /* increment token counter */
                                break;
                            default:
                               break;  /* ignore other tokens */
                         }  /* END SWITCH */
                     }   /* END WHILE */
                 } else  {
                    fprintf(stderr, "Unable to open %s\n", *argv);
                 }   /* END IF */
             }   /* END FOR */
          }

Figure X.14 The main Module (Syntax Analyzer Version)



USING LEX WITH A SYNTAX ANALYZER

Let's use the previous example, the Shell complexity analyzer, to explain how to analyze syntax. Figure X.14 shows the new main program and Figure X.15 shows the new lex program that returns tokens to the main program. Figure X.16 shows the definition of the tokens. The yywrap routine stays the same.



     %{
     #include "shelltoken.h"
     #include "shellmet.h"
     %}
     %%
     case      return(CASE);
     for       return(FOR);
     foreach   return(FOREACH);
     if        return(IF);
     repeat    return(REPEAT);
     switch    return(SWITCH);
     until     return(UNTIL);
     while     return(WHILE);
     [#]       return(COMMENT_START);
     [\n]      return(NEWLINE);
     [ \t]          ;
     [A-Za-z0-9_.]+ {   /* match and delete anything else that
                         might contain the above */
               }
     [^#A-Za-z0-9_.]    { ;  /* match and delete anything else
                         (e.g. numbers, punctuation) */
               }
     %%

Figure X.15 Lex Code (Syntax Analyzer Version)





                         /* shell tokens */

                        #define CASE                1
                        #define FOR                 2
                        #define FOREACH             3
                        #define IF                  4
                        #define REPEAT              5
                        #define SWITCH	            6
                        #define UNTIL               7
                        #define WHILE               8
                        #define COMMENT_START       9
                        #define NEWLINE             10

Figure X.16 The shelltoken.h include File



Notice how the yylex routine returns tokens to the main program, which now tallies them. Also notice that "#" is defined as COMMENT_START and that a NEWLINE is the end of a Shell comment.
In the main module, a return of COMMENT_START causes the logic to loop until it finds a NEWLINE to end the comment. Then, it falls through and increments the NEWLINE counter. As you can see, syntax analysis of language simplifies the regular expressions used in the lex code and places the burden of analysis on the calling module.
If you need to get into complicated grammars or syntax, yacc-- the parser generator-- generates programs from more easily understood syntax specifications. The two, in tandem, are powerful tools to analyze text and its grammar or syntax.



SUMMARY

I've covered the lexical analyzer and many of its robust features in just enough detail to make you dangerous. I recommend reading the documentation on lex to discover its other capabilities. To write your own lexical analyzer, see Kernighan and Plauger (1976). For a more thorough understanding of parsers and lexical analyzers, I recommend Aho et al. (1985).
I hope this chapter has given you some understanding of lex and how to use it to build filters and syntax analyzers. And I hope you are intrigued by the joy of lex.



EXERCISES

1. Write a lexical analyzer to embolden all of the control keywords in C language: if, else, switch, while, until, and {}.

2. Expand the Shell metrics analyzer to count all files and variables.

Cover

ISBN 0471168947

Wiley Computer Publishing
Timely. Practical. Reliable.

[ Home ] [ Appendix X - The Shell Filter Builder ] [ Appendix Y - Nroff and Troff ] [ Appendix Y - Nroff and Troff - continued ] [ Appendix Z - Regular Expressions ]