scanning & parsing with lex and yacc
DESCRIPTION
Submissions: 99 Average for A2: 71% Early submission bonus: 1 Full marks: 5 16 teams attempted nonce bonus 7 got full marks 7 teams attempted ACC bonus 7 got full marks. Can we generate code to support mundane coding tasks and safe time?. Scanning & Parsing with Lex and YACC. - PowerPoint PPT PresentationTRANSCRIPT
Scanning & Parsing with Lex and YACC
Hans-Arno Jacobsen
ECE 297
Can we generate code to support mundane coding tasks and safe time?
Powerful, but not easy
Give you an example for Milestone 1.
•Submissions: 99•Average for A2: 71%•Early submission bonus: 1•Full marks: 5•16 teams attempted nonce bonus
•7 got full marks•7 teams attempted ACC bonus
•7 got full marks
CoursePeer – try it out!
• Developed by a former ECE297 student– Many of the videos under tips & tricks are from him too
• Short video about CoursePeer
• To sign up and auto-enrol under ECE297, use this link– http://www.crspr.com/?rid=339
• Will have a quick demo and use it on Wednesday for our Q&A session
Know your tools!
• Can we generate code based on a specification of what we want?
• Is the specification simpler than writing a program for doing the same task?
• Fully automated program generation has been a dream since the early days of computing.
Where do we need parsing in the storage server?
Where do we need parsing in the storage server?
• Configuration file (file)• Bulk loading of data files (file)• Protocol messages (network)
• Command line arguments (string)
Parsing
• default.conf – the way the disk may see it
server_host localhost \n server_port 1111 \n table marks \n # This datadirectory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF
server_host localhost server_port 1111table marks
data_directory ./data
PROPERTY VALUEPROPERTY VALUE(TABLE TABLE-NAME)+PROPERTY VALUE
Tokens
ScenariosWhere we’d like to safe time in writing a quick language processor?
Conceptually speaking• Languages
– Data description language– Script language– Markup language
• System configurations
• Workload generation
In our storage servers• Languages
– Data schema & data– Query language– Output formatting (Web,
Latex, PDF, Word, Excel)
• Storage server configuration
• Benchmarking
Parser generation from 30K feet
SpecificationSpecification Generator
Generator
Other code
Other code
Generated code
Written by developer
Written by developer
Compiler / LinkerExecut-
able
Scanning & parsing I
PROPERTY
server_host localhost \n server_port 1111 \n table marks \n # This data
PROPERTY VALUEPROPERTY VALUE(TABLE TABLE-NAME)+PROPERTY VALUE
Scanning
Parsing
ProcessingVerify content, add to data structures, …
VALUE PROPERTY VALUE …
Regular expressions
• (TABLE TABLE-NAME)+– TABLE TABLE-NAME– TABLE TABLE-NAME TABLE TABLE-NAME– …
• Regular expressions (formal languages)
• Extended regular expressions (UNIX)
Patterns
Scanning & parsing II
• Parsing is really two steps– Scanning (a.k.a. tokenizing or lexical analysis)– Parsing, i.e., analysis of structure and syntax according to
a grammar (i.e., a set of rules)• flex is the scanner generator (open source)
– Fast Lex for lexical analysis• YACC is the parser generator
– Yet Another Compiler Compiler for structural and syntax analysis
• Lex and YACC work together• Generated scanner drives the generated parser
• We use flex (fast Lex) and Bison (GNU YACC)• There are myriads of other tools for Java, C++, …, some
of which combine Lex/Yacc into one tool (e.g., javacc)
Objectives for today
• Cover the basics of Lex & Yacc
• Everybody should have an appreciation of the potential of these tools
• There is a lot more detail that remains unsaid
• To challenge you
Lex & YACC overview
LexicalAnalyzerinput stream token stream
Structural Analyzertoken stream
Output defined byactions in parser
specification(often an in-memory
representation of input)
server_host localhost \n server_port 1111 \n table marks \n # This data directorymay be an absolute or relative path. \n data_directory ./data \n\n\n \EOF
PROPERTY VALUE PROPERTY VALUE
LEXICAL ANALYSIS WITH LEX
You can control the name of
generated file
Lex introduction
flexInput specification
(*.l)
lex.yy.c
C compiler
LexicalAnalyzerinput stream token stream
You generate thelexical analyzer by using flex
flex is fast Lex
Synonyms: lexical
analyzer, scanner, lexer,
tokenizer
Lex• Input specification for lex – the “program”
– Three parts: Definitions, Rules, User code– Use “%%” as a delimiter for each part
• First part: Definitions– Options used by flex inside the scanner– Defines variables & macros– Code within “%{” and “%}” directly copied into the
scanner (e.g., global variables, header files)• Second part: Rules
– Patterns and corresponding actions• Actions are executed when corresponding pattern(s)
matches– Patterns are defined by regular expressions
Parsing the configuration file of Milestone 1
%{#include "config_parser.tab.h"...
%}a2Z [a-zA-Z]host server_hostport server_port dir data_directory
%%
{host} { return HOST_PROPERTY; }{port} { return PORT_PROPERTY; }table { return TABLE; }{dir} { return DDIR_PROPERTY; }[\t\n ]+ { }#.*\n { }{a2Z}* { yylval.sval = strdup(yytext);
return STRING; }[0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; }
. { return yytext[0]; }…
Shorthands for use below config_parser.l
Pattern
Action
flex pattern matching principles
• Actions are executed when patterns match– Tokens are returned to caller; next pattern …
• Patterns match a given input character or string only once– Input stream is consumed
• flex executes the action for the longest possible matching input– Order of patterns in the spec. is important
flex regular expressions by example I(Really: extended regular expressions)
`x‘ match the character 'x' `.‘ any character (byte) except newline`[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j'
through 'o', or a 'Z‘`[^A-Z]‘a "negated character class", i.e., any
character EXCEPT those in the class`[^A-Z\n]’ any character EXCEPT an uppercase
letter or a newline
flex regular expression by example II
`r*‘ zero or more r's, where r is any regular expression
`r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”)‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's‘<<EOF>>' an end-of-file
r is any regular
expression
flex regular expressions
• There are many more expressions, see manual
• Form complex expressions– E.g.: IP address, names, …
• The expression syntax is used in other tools as well (well worth learning)
Parsing the configuration file of Milestone 1%{#include "config_parser.tab.h"...
%}a2Z [a-zA-Z]host server_hostport server_port dir data_directory
%%
{host} { return HOST_PROPERTY; }{port} { return PORT_PROPERTY; }table { return TABLE; }{dir} { return DDIR_PROPERTY; }[\t\n ]+ { }#.*\n { }{a2Z}* { yylval.sval = strdup(yytext);
return STRING; }[0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; }. { return yytext[0]; }<<EOF>> { return 0; }
config_parser.l
User-defined variable in YACC(conveys token value to YACC)
server_host localhost server_port 1111table marks
data_directory ./data
PARSING WITH YACC
YACC introducing
YACCInput specification
(*.y)
y.tab.c
C compiler
Syntax analyzer / parser
token stream, e.g.,via flex
Output defined byactions in parser
specification
From the specified grammar, YACC generates a parser which recognizes
“sentences” according to the grammar
You can control the name of
generated file
YACC• Input specification for YACC (similar to flex)
– Three parts: Definitions, Rules, User code– Use “%%” as a delimiter for each part
• First part: Definitions– Definition of tokens for the second part and for use by flex– Definition of variables for use by the parser code
• Second part: Rules– Grammar for the parser
• Third part: User code– The code in this part is copied into the parser generated by
YACC
Configuration file parser Milestone 1
%{#include <string.h>#include <stdio.h>
struct table *tl, *t;struct configuration *c;
/* define a linked list of table names */
struct table { char *table_name; struct table *next;};
/* define a structure for the configuration information */
struct configuration { char *host; int port; struct table *tlist; char *data_dir; };
Definition sectionconfig_parser.y
Configuration file parser Milestone 1
%}%union{ char *sval; // String value (user defined) int pval; // Port number value (user defined)}%token <sval> STRING%token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY
DDIR_PROPERTY TABLE
%% Definition section cont’d.
config_parser.y
Configuration file parser Milestone 1
property_list:HOST_PROPERTY STRINGPORT_PROPERTY NUMBERtable_listdata_directory
;table_list:
table_list TABLE STRING| TABLE STRING
;
data_directory: DDIR_PROPERTY STRING ;%%
(Grammar) Rules section(simplified)
config_parser.y
data_directory:
DDIR_PROPERTY STRING { c = (struct configuration *)
malloc(sizeof(struct configuration));
// Check c for NULL
c->data_dir = strdup( $2 ); } ;
config_parser.y
$1 $2
(Grammar) Rules section(details)
struct configuration { char *host; int port; struct table *tlist; char *data_dir; };
struct configuration *c;
property_list:
HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ;
config_parser.y
struct configuration { char *host; int port; struct table *tlist; char *data_dir; };
(Grammar) Rules section(details)
struct configuration *c;
… TABLE STRING TABLE STRING
Configuration file parser Milestone 1
property_list:HOST_PROPERTY STRINGPORT_PROPERTY NUMBERtable_listdata_directory
;table_list:
table_list TABLE STRING| TABLE STRING
;
data_directory: DDIR_PROPERTY STRING ;%%
(Grammar) Rules section(simplified)
config_parser.y
table_list is a recursive rule
• Example table specification in configuration filetable MyCoursestable MyMarkstable MyFriends
• table_list: table_list TABLE STRING | TABLE STRING ;
• Terminology– table_list is called a non-terminal– TABLE & STRING are terminals
Recursive rule executiontable_list : table_list TABLE STRING
table_list TABLE STRING TABLE STRING
TABLE STRING TABLE STRING TABLE STRING
table MyCoursestable MyMarkstable MyFriends
table MyCourses
table MyMarks table MyCourses
table MyMarks table MyCoursestable MyFriends
table_list: table_list TABLE STRING |TABLE STRING ;
table_list:
table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING { tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; } ;
table
tabletl =
config_parser.y
struct table { char *table_name; struct table *next; };
$1 $2 $3
$1 $2
tl
t->next = tl
tl->next = NULL
t
struct table *tl, *t;
How to invoke the parser
int main (int argc, char **argv){
FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f;
while( ! feof(yyin) ) { if (yyparse() != 0) {
…yyerror("");exit(0);
}; } fclose(f); } …
• yylex() for calling generated scanner• by default called within yyparse()
In the Makefile
lexer: config_parser.l${LEX} config_parser.l${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c
yaccer: config_parser.y${YACC} -d config_parser.y${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c
parser: config_parser.tab.o lex.yy.o${CC} ${CFLAGS} ${INCLUDE} -c parser.c${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \
config_parser.tab.o \parser.o
Benefits• Faster development
– Compared to manual implementation• Easier to change the specification and
generate new parser– Than to modify 1000s of lines of code to add,
change, delete an existing feature• Less error-prone, as code is generated• Cost: Learning curve
– Invest once, amortized over 40+ years career
If you want to know more• Lecture, examples and some recommended
reading are enough to tackle all of the parsing for Milestone 3 & 4
• 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC
• Lectures on Computability and Theory of Computation may also show you these algorithms
A flex specification%{ #include <stdio.h#include "y.tab.h"int c;extern int yylval;%}%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); }[^a-z0-9\b] { c = yytext[0]; return(c); }
The Header
The “Guts”:Regular
expressions annotated with
actions
Temporary variable(s)
The header
%{ #include <stdio.h#include "y.tab.h"int c;extern int yylval;
%}%%
Special variable• defined in scanner • used in parser• for transferring values associated with tokens to parser
dividing line between
header and rules section
The rules%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return (DIGIT); }[^a-z0-9\b] { c = yytext[0]; return(c); }
the string associated with the token
the string associated with the token
yytext: the string associated
with the token
The rules
%%" " ;[a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); }[0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); }[^a-z0-9\n] { c = yytext[0]; return(c); }
sets yylval to the character’s
alphabetical order
sets yylval to digit’snumerical value
otherwise simply returns that character;
presumably it’s an operator: +*-, etc.
Simple example
• Implement a calculator which can recognize adding or subtracting of numbers
[linux33]% ./y_calc1+101 = 102[linux33] % ./y_calc1000-300+200+100 = 1000[linux33] %
Example – the Lex part%{#include <math.h>#include "y.tab.h"extern int yylval;%}
%%[0-9]+ { yylval = atoi(yytext);
return NUMBER; }[\t ]+ ; /* Do nothing for white space */\n return 0;/* End of the logic */. return yytext[0];%%
pattern
action
Definitions
Rules
Example – the Yacc part%token NAME NUMBER
%%
statement: NAME '=' expression
| expression
{ printf("= %d\n", $1); }
;
expression:expression '+' NUMBER
{ $$ = $1 + $3; }
|expression '-' NUMBER
{ $$ = $1 - $3; }
| NUMBER
{ $$ = $1; }
;
Definitions
Rules
Include Yacc library(-ly)