chap. 3, theory and practice of scanning by j. h. wang mar. 8, 2011

Chap. 3, Theory and Practice of Scanning

By J. H. WangMar. 8, 2011

Outline

• Overview of a Scanner• Regular Expressions• Examples• Finite Automata and Scanners• The Lex Scanner Generator• Other Scanner Generators• Practical Considerations of Building

Scanners• Regular Expressions and Finite Automata• Summary

Overview of a Scanner

• Interactions between the scanner and the parser

ScannerScanner ParserParser

Symbol Table

source program

To semantic analysis

getNextToken

Overview of a Scanner

• Lexical analyzer, or lexer• Token structure can be more detailed and

subtle than one might expect– String constants: “”

• Escape sequence: \”, \n, …• Null string

– Rational constants• 0.1, 10.01, • .1, 10. vs. 1..10

• Possible to examine a language for design flaws

• Scanner generator avoids reimplementing common components

• Programming scanner generator: declarative programming– What to scan, not how to scan– E.g. database query language, Prolog, …

• Performance of scanners important for production compilers, for example:– 30,000 lines per minute (500 lines per second)– 10,000 characters per second (for an average line of 20

characters)– For a processor that executes 10,000,000 instructions

per second, 1,000 instructions per input character– Considering other tasks in compilers, 250 instructions

per character is more realistic

Regular Expressions

• Convenient way to specify various simple sets of strings

• Search patterns in the Unix utility grep

• Context search in most editors

• Regular set: a set of strings defined by regular expressions

• Lexeme: an instance of a token class– E.g.: identifier

• Vocabulary (): a finite character set– ASCII, Unicode

• Empty or null string (λ)• Meta-character: ()’*+|

– E.g.: (‘(‘|’)’|;|,)

• Operations– Catenation: joining individual characters to

form a string• sλ≡λs≡ s• If s1P and s2Q, then s1s2(P Q)

– Alternation (|): to separate alternatives• E.g. D=(0|1|2|3|4|5|6|7|8|9)• The string s(P|Q) iff sP or sQ

– e.g. (LC|UC)

– Kleene closure (*): postfix Kleene closure operator

• P*: the catenation of zero or more selections from P• sP* iff s=s1s2…sn such that siP(1≤i≤n)

• Regular expressions can be defined as follows: is a regular expression for empty set– λ is a regular expression for the set that

contains only the empty string– s is a regular expression denoting {s}– If A and B are regular expressions, then

A|B, AB, and A* are also regular expressions

• Additional operations– P+: positive closure

• P*=(P+|λ), P+=P*P• E.g.: (0|1)+

– Not(A): all characters in not included in A (-A)

• E.g. Not(Eol)• Not(S): (*-S) if S is a set of strings

– Ak: all string formed by catenating k strings from A

• E.g. (0|1)32

Examples

• D: the set of the ten single digits• L: the set of all upper- and lower-case

letters• Java or C++ single-line comment

– Comment=//(Not(Eol))*Eol• Fixed-decimal literal

– Lit=D+.D+• Optionally signed integer literal

– IntLiteral=(‘+’|-|λ)D+• Comments delimited by ## markers which

allows single #’s within the comment– Comment2=##((#|λ) Not(#))* ##

• All finite sets are regular• Some infinite sets are regular

– E.g.: {[m]m|m>=1} is not regular

• All regular set can be defined by CFGs• Regular expressions are quite adequate

for specifying token-level syntax• For every regular expression we can

create an efficient device (finite automaton) that recognizes exactly those strings that match the regular expression’s pattern

Finite Automata and Scanners

• A finite automaton (FA) can recognize the tokens specified by a regular expressions– A finite set of states– A finite vocabulary – A set of transitions (or moves) from one state

to another– A start state– A subset of the states called the accepting (or

final) states

• E.g. Fig. 3.1 - (abc+)+

Deterministic Finite Automata

• DFA: an FA that always has a unique transition

• Transition table T: two-dimensional array indexed by a DFA state s and a vocabulary symbol c– T[s,c]– E.g.: Fig. 3.2 - // (Not(Eol))* Eol

• Full transition table contains one column for each character– To save space, table compression is

utilized where only nonerror entries are explicitly represented (using hashing or linked structures)

– Any regular expression can be translated into a DFA that accepts the set of strings denoted by the regular expression

Coding the DFA

• A DFA can be coded in one of two forms– Table-driven

• Transition table is explicitly represented in a runtime table that is “interpreted” by a driver program

• Token independent• E.g. Fig. 3.3

– Explicit control• Transition table appears implicitly as the control logic

of the program• Easy to read, more efficient, but specific to a single

token definition• E.g. Fig. 3.4

• Two more examples of regular expressions– Fortran-like real literal

• RealLit=(D+(λ|.)) | (D*.D+)• Fig. 3.5(a)

– Identifier• ID=L(L|D)*(_(L|D)+)*• Fig. 3.5(b)

Transducers

• An FA that analyzes or transforms its input beyond simply accepting tokens– E.g. identifier processing in symbol table

• An action table can be formulated that parallels the transition table

The Lex Scanner Generator

• Lex– Developed by M.E. Lesk and E. Schimidt,

AT&T Bell Lab.– Flex: a free reimplementation that

produces faster and more reliable scanners

– JFlex: for Java– (Fig. 3.6)

The Operation of the Lex Scanner Generator

• Steps– Scanner specification– Lex generates a scanner in C– The scanner is compiled and linked with

other compiler components

Defining Tokens in Lex

• Lex allows the user to associate regular expressions with commands coded in C (or C++)

• Lex creates a file lex.yy.c that contains an integer function yylex()– It’s normally called from the parser

when a token is needed– It returns the token code of the token

scanned by Lex

• It’s important that the token codes returned are identical to those expected by the parser– To share the definition of token codes in

the file y.tab.h

The Character Class

• A set of characters treated identically in a token definition– identifier, number– Delimited by [ ]– \, ^, ], - must be escaped

• [\])]– Range: -

• [x-z], [0-9], [a-zA-Z]– Escape character: \

• \t, \n, \\, \010– Complement: ^ (Not() operation)

• [^xy], [^0-9], [^]

Using Regular Expression to Define Tokens

• Catenation: juxtaposition of two expressions– [ab][cd]

• Alternation: |• Case is significant

– (w|W)(h|H)(i|I)(l|L)(e|E)• Kleene clousre * and positive closure +• Optional inclusion: ? (zero times or once)

– expr?, expr|λ• . (any single character other than a

newline)• ^ (beginning of a line), $ (end of line)

– ^A.*e$

• Three sections– First section

• symbolic names associated with character classes and regular expressions

• Source code: %{ … %}– Variable, procedure, type declarations– E.g.

%{ #include “tokens.h”%}

– Second section: table of regular expressions and corresponding commands

• Input that is matched is stored in a global string variable yytext (whose length is yyleng)

• The default size of yytext is determined by YYLMAX (default: 200)

– May need to redefine YYLMAX to avoid overflow• Content of yytext is overwritten as each new token is

scanned– It’s safer to copy the contents of yytext (using strcpy())

before the next call to yylex()• In the case of overlap

– The longest possible match– The earlier expression is preferred

Character Processing Using Lex

• A general-purpose character processing tool– Definitions of subroutines may be placed in the

final section• E.g. {Non_f_i_p} {insert(yytext); return(ID); }• Insert() could also be placed in a separate file

– End-of-file not handled by regular expressions• A predefined token EndFile, with token code of zero, is

automatically returned by yylex()

– yylex() uses input(), output(), unput()• when end-of-file encountered, yylex() calls yywrap() • yywrap() returns 1 if there’s no more input

• The longest possible match could sometimes be a problem– E.g. 1..10 vs. 1. and .10– Lex allows us to define a regular expression

that applies only if some other expression immediately follows it

• r/s: to match r only if s immediately follows it• s: right-context• E.g. [0-9]+/”..”

• Symbols might have different meanings in a regular expression and in a character class– Fig. 3.13

Summary of Lex

• Lex is a very flexible generator– Difficult part: learning its notation and

rules• Lex’s notation for representing regular

expressions is used in other programs– E.g. grep utility

• Lex can also transform input as a preprocessor

• Code segments must be written in C– Not language-independent

Creating a Lexical Analyzer with Lex

Lex compilerLex source program lex.l lex.yy.c

C compilerlex.yy.c a.out

a.outInput stream Sequence of tokens

Another Example

• Patterns for tokens in the grammar– digit [0-9]

digits digit+

number digits (. digits)? (E [+-]? digits )?letter [A-Za-z]id letter (letter |digit)*if ifthen thenelse elserelop < | > | <= | >= | = | <>

– ws (blank | tab | newline)+

Example Lex Program• %{ LT, LE, EQ, NE, GT, GE, IF,

THEN, ELSE, ID, NUMBER, RELOP%}delim [ \t\n]ws {delim}+letter [A-Za-z]digit [0-9]id {letter}({letter}|{digit})*number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?%%{ws} {}if { return(IF); }then { return(THEN); }else { return(ELSE); }

• {id} { yylval = (int) installID(); return (ID); }{number} { yylval = (int) installNum(); return (NUMBER); }“<“ {yylval = LT; return(RELOP); }“<=“ {yylval = LE; return(RELOP); }“=“ {yylval = EQ; return(RELOP); }“<>“ {yylval = NE; return(RELOP); }“>“ {yylval = GT; return(RELOP); }“>=“ {yylval = GE; return(RELOP); }%%int installID() {}int installNum() {}

Other Scanner Generators• Flex: free

– It produces scanners than are faster than the ones produced by Lex

– Options that allow tuning of the scanner size vs. speed• JFlex: in Java• GLA: Generator for Lexical Analyzers

– It produces a directly executable scanner in C– It’s typically twice as fast as Flex, and it’s competitive with the

best hand-written scanners• re2c

– It produces directly executable scanners• Alex, Lexgen, …• Others are parts of complete suites of compiler development

tools– DLG: part of PCCTS suites– Coco/R– Rex: part of Karlsruhe/CocoLab cocktail toolbox

Practical Considerations of Building Scanners

• Finite automata sometimes fall short• Efficiency concerns• Error handling

Processing Identifiers and Literals

• Identifiers can be used in many contexts– The scanner cannot know when to enter

the identifier into the symbol table for the current scope or when to return a pointer to an instance from earlier scope• String space: an extendable block of memory

used to store the text of identifiers– It avoids frequent calls to new or malloc, and space

overhead of storing multiple copies of the same string

• Hash table: to assign a unique serial number for each identifier

• Literals require processing before they are returned– Numeric conversion can be tricky: overflow or

roundoff errors• Standard library routines: atoi(), atof()

– Ex.: (in C)• a (* b)

– A call to procedure a– Declaration of an identifier b that is a pointer variable

(if a has been declared in a ‘typedef’)

• To create a table of currently visible identifiers and return a special token typeid for typedef declarations

Processing Reserved Words

• Keywords: if, while, …– Most programming languages choose to

make keywords reserved• To simplify parsing• To make programs more readable • Ex. (in Pascal and Ada)

– begin begin; end; end; begin;end

– Ex. (in PL/I) • explicit call keyword• if if then else = then;

– Ex. (in COBOL)• Several hundred reserved words, such as zero, zeros,

zeroes.

• How to recognize reserved words– By creating distinct regular expressions for each

• Ex. (in Pascal, 35 reserved words): # of states 37 -> 165

• Neither Lex nor Flex provides a complement operator for regular expressions– Ex. Nonreserved identifiers: not(not(ident)|if|while|…)– Ex.: L|(LL)|((LLL)L+)|((L-’E’)L*)|(L(L-’N’)L*)|(LL(L-’D’)L*)

• Too complex!

• Treat reserved words as ordinary identifiers, and use an exception table to detect them– Sorted list: for binary search– Hash table– Perfect hash functions– To enter reserved words into string

space in advance

Using Compiler Directives and Listing Source Lines

• Compiler options may be processed either by the scanner or by subsequent compiler phases– In C, source inclusion and macro processing

directives are typically handled by a preprocessing phase prior to scanning and parsing

– Conditional compilation directives– Source line listing

• Error messages• Inserting, deleting, replacing, or reformulating symbols in

a source line• Source lines for reading and writing not always 1-1 • Line number and position marker

Terminating the Scanner

• End-of-file pseudocharacter– Eof (-1): InputStream.Read() in Java– Corresponding to the EndFile Token

• What if a scanner is called after Eof– Continue to return the EndFile Token

Multicharacter Lookahead

• To look ahead beyond the next input character– Ex. (in Fortran)

• DO 10 J = 1,100• DO 10 J = 1.100

– Space is not significant in Fortran

– Ex. (in Pascal or Ada): 10..100– Ex. (in C): 12.3e+q

• To back up and return 4 tokens• Syntax error

Performance Considerations

• To increase scanner speed– Use a scanner generator such as Flex or

GLA– General principles

• Try to block character-level operations whenever possible

– Ex.: InputStream.read() vs. InputStream.read(buffer)

– End of a block won’t usually correspond to end of a token

» Double-buffering (Fig. 3.16)

• Avoid unnecessary copying of characters– No copying needed from the input buffer unless

we recognize a token whose text must be saved or processed

• Using a profiling tool such as gpt, prof, gprof, or pixie to find unexpected performance bottlenecks

Lexical Error Recovery

• Two approaches– Delete the character read so far, and

restart scanning– Delete the first character read, and

resume scanning• A bit harder, but a bit safer

Handling Runaway Strings and Comments Using Error Tokens

• In Java, strings are not allowed to cross line boundaries– To introduce an error token

• Valid strings: “ (Not( “|Eol|\ )|\”|\\)* ”• Runaway string: “ (Not( “|Eol|\ )|\”|\\)* Eol

– Similar problem for multiline comments in C, C++, Java, and Pascal

• Pascal comments: { Not(})* }• Runaway comments: { Not(})* Eof• Correct, but suspect open comments: { (Not(})*

{ Not({|})*)+ }

Regular Expressions and Finite Automata

• To transform a regular expression into an equivalent FA– Transforming the regular expression into

a NFA (nondeterministic FA)– Transforming the NFA into a DFA

• NFA allows– Transition labeled with λ– Multiple transitions

Transforming a Regular Expression into an NFA

• A regular expression is built of the atomic regular expressions a (where a is a character in ) and λ by using three operations AB, A|B, and A*.– Atomic regular expressions (Fig. 3.19)– A|B (Fig. 3.20)– AB (Fig. 3.21)– A* (Fig. 3.22)

Creating the DFA

• Subset construction algorithm (Fig. 3.23)– D will be in state {x,y,z} iff N could be in

any of the states x, y, or z– CLOSE(): The set of states reachable by

following only λ transitions– Ex. (Fig. 3.24 & Fig. 3.25)

AKE ETERMINSTIC

ECORD TATE

• The resulting DFA can sometimes be much larger than the original NFA– If NFA has n states, DFA may have as

many as 2n states– Fortunately, the NFAs for regular

expressions in programming language tokens do no exhibit this problem

Optimizing Finite Automata

• For every DFA, there is a unique smallest equivalent DFA– Unreachable states: states that cannot be

reached from the start state– Dead states: states that cannot reach any

accepting state

• We optimize DFA by merging states we know to be equivalent– If two states s1 and s2 are equivalent, all

transitions to s2 can be replaced with transitions to s1

• How to decide what states to merge?– A greedy approach

• Start with two states: accepting and nonaccepting

• If all constituents of a merged state do no agree on the transition for some character, then the merged state is split into two or more smaller states that do agree

• Ex. (Fig. 3.26 & Fig. 3.28)• Split Algorithm (Fig. 3.27)

ARGET LOCK

Translating Finite Automata into Regular Expressions

• It’s useful when you already have an FA, and you need a regular expressions to program Lex– FindRE Algorithm (Fig. 3.31)– Start with an FA that has a start state and a

single accepting state– Removing states one-by-one by three simple

transformations• T1: R|S (or) Fig. 3.30(a)• T2: XY (bypass) Fig. 3.30(b)• T3: XZ*Y (bypass) Fig. 3.30(c)

– Until we have an FA with a single transition from the start state to a single accepting state

• Ex. (Fig. 3.32)– b*ab(a|b|λ)|b*aa|b*a– b*aba| b*abb|b*ab|b*aa|b*a– b*a(ba|bb|b|a|λ)

Thanks for Your Attention!

chap. 3, theory and practice of scanning by j. h. wang mar. 8, 2011

regularall regular set

regular expressionslexeme

editorsregular set

regular expression denoting

set of stringsak

finite automatasummaryoverview

finite character setascii

sip1inregular expressions

Documents

scanning probe microscopy investigation of metal · pdf...

synchrotron x-ray scanning tunneling...

notice technique des moteurs continental serie...

scanning tunneling microscopy - university of...

fortgeschrittene funktionale contents programmierung...chap....

introdu4 (chap 1) - uo camcor electron microprobe...

memory management in linux (chap. 8 in understanding the...

chap 5 end of chap sol

chap 7 end of chap sol

memory addressing in linux (chap. 2 in understanding the...

hougang secondary school - hougangsec.moe.edu.sg · chap 3:...

chap. 5, top-down parsing j. h. wang mar. 29, 2011

original citation - wrap: warwick research archive...

lecture 24 loan securitization – market risk chap 26, chap...

chap. 6, bottom-up parsing j. h. wang may 17, 2011

lecture 23: port scanning, vulnerability scanning, packet...

scanning the marketing environment chap 3

573 scanning pro 21. scanning probe...

scanning tunneling microscopy by jingpeng wang chem*7530 feb...

authors: k.h. wang, t.h. wang, w.l. wang, s.c. huang