CS375
Compilers
Lexical Analysis4th February, 2010
2
Outline
• Overview of a compiler.• What is lexical analysis?• Writing a Lexer
– Specifying tokens: regular expressions
– Converting regular expressions to NFA, DFA
• Optimizations.
3
How It Works
Source code(character stream)
Lexical Analysis
Syntax Analysis(Parsing)
Tokenstream
Abstract syntaxtree (AST)
Semantic Analysis
if (b == 0) a = b;
if
( b ) a = b ;0==
if==
b 0
=
a b
Decorated AST
if
==
int b int 0
=
int alvalue
int b
boolean int
Program representation
What is a lexical analyzer
• What?– Reads in a stream of characters and
groups them into “tokens” or “lexemes”.– Language definition describes what
tokens are valid.
• Why?– Makes writing the parser a lot easier,
parser operates on “tokens”.– Input dependent functionality such as
character codes, EOF, new line characters.
4
5
First Step: Lexical AnalysisSource code(character stream)
Lexical Analysis
Token stream
Semantic Analysis
if (b == 0) a = b;
if ( b ) a = b ;0==
Syntax Analysis
6
What it should do?
?
Token? {Yes/No}InputString
Token Description
w
We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.
STARTING OFF
7
8
Tokens
• Logical grouping of characters.• Identifiers: x y11 elsen _i00• Keywords: if else while break• Constants:
– Integer: 2 1000 -500 5L 0x777– Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10– String: ”x” ”He said, \”Are you?\”\n”– Character: ’c’ ’\000’– Symbols: + * { } ++ < << [ ] >=
• Whitespace (typically recognized and discarded): – Comment: /** don’t change this **/– Space: <space> – Format characters: <newline> <return>
9
Ad-hoc Lexer• Hand-write code to generate tokens• How to read identifier tokens?
Token readIdentifier( ) {String id = “”;while (true) {
char c = input.read();if (!identifierChar(c))
return new Token(ID, id, lineNumber);id = id + String(c);
}}
• Problems– How to start?– What to do with following character?– How to avoid quadratic complexity of repeated
concatenation?– How to recognize keywords?
10
• Scan text one character at a time• Use look-ahead character (next) to
determine what kind of token to read and when the current token endschar next;…while (identifierChar(next)) {
id = id + String(next);next = input.read ();
}
Look-ahead Character
e l s e n
next(lookahead)
11
Ad-hoc Lexer: Top-level Loop
class Lexer {InputStream s;char next;Lexer(InputStream _s) { s = _s; next = s.read(); }Token nextToken( ) {if (identifierFirstChar(next))//starts with a charreturn readIdentifier(); //is an identifierif (numericFirstChar(next)) //starts with a numreturn readNumber(); //is a numberif (next == ‘\”’) return readStringConst();…}
}
12
Problems
• Might not know what kind of token we are going to read from seeing first character– if token begins with “i’’ is it an
identifier? (what about int, if )– if token begins with “2” is it an
integer constant?– interleaved tokenizer code hard to
write correctly, harder to maintain– in general, unbounded look-ahead
may be needed
Problems (cont.)
• How to specify (unambiguously) tokens.
• Once specified, how to implement them in a systematic way?
• How to implement them efficiently?
13
14
Problems (cont. )
• For instance, consider.– How to describe tokens unambiguously
2.e0 20.e-01 2.0000“” “x” “\\”
“\”\’”
– How to break up text into tokensif (x == 0) a = x<<1; if (x == 0) a = x<1;
– How to tokenize efficiently• tokens may have similar prefixes• want to look at each character ~1 time
15
Principled Approach
• Need a principled approach1.Lexer Generators
– lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator
2.Your own Lexer – Describe programming language’s
tokens with a set of regular expressions– Generate scanning automaton from
that set of regular expressions
Top level idea…
• Have a formal language to describe tokens.– Use regular expressions.
• Have a mechanical way of converting this formal description to code.– Convert regular expressions to finite
automaton (acceptors/state machines)
• Run the code on actual inputs.– Simulate the finite automaton.
16
An Example : Integers
• Consider integers. • We can describe integers using the
following grammar:• Num -> ‘-’ Pos• Num -> Pos• Pos ->0 | 1 |…|9• Pos ->0 | 1 |…|9 Pos
• Or in a more compact notation, we have:– Num-> -? [0-9]+
17
An Example : Integers
• Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0.
• We can also represent above regular expression as a state machine.– This would be useful in simulation of
the regular expression.
18
An Example : Integers
• The Non-deterministic Finite Automaton is as follows.
• We can verify that -123, 65, 0 are accepted by the state machine.
• But which path to take? -paths?
19
0 1
-2
0-9
3
0-9
An Example : Integers
• The NFA can be converted to an equivalent Deterministic FA as below.– We shall see later how.
• It accepts the same tokens.– -123– 65– 0
20
{0,1} {1}
{2,3}
-
0-90-9
0-9
An Example : Integers
• The deterministic Finite automaton makes implementation very easier, as we shall see later.
• So, all we have to do is:– Express tokens as regular expressions– Convert RE to NFA– Convert NFA to DFA– Simulate the DFA on inputs
21
The larger picture…
22
RE NFAConversionNFA DFAConversion
DFASimulation
Yes, if w is valid token
No, if notInputString
Regular Expression describing tokens
R
w
QUICK LANGUAGE THEORY REVIEW…
23
24
Language Theory Review• Let be a finite set
called an alphabet– a called a symbol
* is the set of all finite strings consisting of symbols from
• A subset L * is called a language• If L1 and L2 are languages, then L1 L2 is
the concatenation of L1 and L2, i.e., the set of all pair-wise concatenations of strings from L1 and L2, respectively
25
Language Theory Review, ctd.
• Let L * be a language• Then
– L0 = {}– Ln+1 = L Ln for all n 0
• Examples– if L = {a, b} then
• L1 = L = {a, b}• L2 = {aa, ab, ba, bb}• L3 = {aaa, aab, aba, aba, baa, bab, bba,
bbb}• …
26
Syntax of Regular Expressions
• Set of regular expressions (RE) over alphabet is defined inductively by– Let a and R, S RE. Then:
• a RE• ε RE RE• R|S RE• RS RE• R* RE
• In concrete syntactic form, precedence rules, parentheses, and abbreviations
27
Semantics of Regular Expressions
• Regular expression T RE denotes the language L(R) * given according to the inductive structure of T:– L(a) ={a} the string “a”– L(ε) = {“”} the empty string– L() = {} the empty set– L(R|S) = L(R) L(S) alternation– L(RS) = L(R) L(S) concatenation– L(R*) = {“”} L(R) L(R2) L(R3) L(R4)
… Kleene closure
28
Simple Examples
• L(R) = the “language” defined by R– L( abc ) = { abc }– L( hello|goodbye ) = {hello, goodbye}
• OR operator, so L(a|b) is the language containing either strings of a, or strings of b.
– L( 1(0|1)* ) = all non-zero binary numerals beginning with 1
• Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.
29
Convienent RE ShorthandR+ one or more strings from L(R): R(R*)R? optional R: (R|ε)[abce] one of the listed characters: (a|b|c|e)[a-z] one character from this range:
(a|b|c|d|e|…|y|z)[^ab] anything but one of the listed chars[^a-z] one character not from this range”abc” the string “abc”\( the character ’(’. . .id=R named non-recursive regular expressions
30
More ExamplesRegular Expression R Strings in L(R)
digit = [0-9] “0” “1” “2” “3” …posint = digit+ “8” “412” …int = -? posint “-42” “1024”
…real = int ((. posint)?) “-1.56” “12”
“1.0” = (-|ε)([0-9]+)((. [0-9]+)|ε)[a-zA-Z_][a-zA-Z0-9_]* C identifierselse the keyword “else”
Historical Anomalies• PL/I
– Keywords not reserved• IF IF THEN THEN ELSE ELSE;
• FORTRAN– Whitespace stripped out prior to scanning
• DO 123 I = 1• DO 123 I = 1 , 2
• By and large, modern language design intentionally makes scanning easier
31
WRITING A LEXER
32
Writing a Lexer
• Regular Expressions can be very useful in describing languages (tokens).
– Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification.
– Have a systematic way of writing a Lexer from a specification such as regular expressions.
33
WRITING YOUR OWN LEXER
34
How To Use Regular Expressions
• Given R RE and input string w, need a mechanism to determine if w L(R)
• Such a mechanism is called an acceptor
35
Input string w(from the program)
R RE(that describes atoken family) ?
Yes, if w is a token
No, if w not a token
Acceptors• Acceptor determines if an input string
belongs to a language L
• Finite Automata are acceptors for languages described by regular expressions
36
Input String
Descriptionof language
Acceptor
w
L
Yes, if w L
No, if w L
Finite Automaton
Finite Automata• Informally, finite automaton consist of:
– A finite set of states– Transitions between states– An initial state (start state)– A set of final states (accepting states)
• Two kinds of finite automata:– Deterministic finite automata (DFA): the
transition from each state is uniquely determined by the current input character
– Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input
37
DFA Example
• Finite automaton that accepts the strings in the language denoted by regular expression ab*a
• Can be represented as a graph or a transition table.– A graph.
• Read symbol• Follow outgoing edge
0 1 2a
b
a
38
DFA Example (cont.)
• Representing FA as transition tables makes the implementation very easy.
• The above FA can be represented as :– Current state and current symbol
determine next state.– Until
• error state.• End of input.
39
a b0 1 Error1 2 12 Error Error
Simulating the DFA
transition_table[NumSTATES][NumCHARS]accept_states[NumSTATES] state = INITIAL
while (state != Error) {
c = input.read();if (c == EOF) break;state = trans_table[state][c];
}return (state!=Error) &&
accept_states[state];
0 1 2a
b
a
• Determine if the DFA accepts an input string
40
RE Finite automaton?• Can we build a finite automaton for every
regular expression?
• Strategy: build the finite automaton inductively, based on the definition of regular expressions
a
a
ε
41
RE Finite automaton?
• Alternation R|S
• Concatenation: RS
• Recall ? implies optional move.
R automaton
S automaton
?
?
?
R automaton S automaton
42
NFA Definition• A non-deterministic finite automaton
(NFA) is an automaton where:– There may be ε-transitions (transitions that
do not consume input characters)– There may be multiple transitions from the
same state on the same input character
a b
b a a
Example:
43
RE NFA intuition
-?[0-9]+
-0-9
0-9
44
When to take the -path?
NFA construction (Thompson)
• NFA only needs one stop state (why?)• Canonical NFA form:
• Use this canonical form to inductively construct NFAs for regular expressions
45
Inductive NFA Construction
46
RSR S
R|S R
S
ε
ε ε
ε
R*R
ε
ε
εε
ε
Inductive NFA Construction
47
RSR S
R|S R
S
ε
ε ε
ε
R*R
ε
ε
εε
DFA vs NFA• DFA: action of automaton on each
input symbol is fully determined– obvious table-driven implementation
• NFA:– automaton may have choice on each
step– automaton accepts a string if there is
any way to make choices to arrive at accepting state
– every path from start state to an accept state is a string accepted by automaton
– not obvious how to implement!48
Simulating an NFA
• Problem: how to execute NFA?“strings accepted are those for which there is some corresponding path from start state to an accept state”
• Solution: search all paths in graph consistent with the string in parallel– Keep track of the subset of NFA states that
search could be in after seeing string prefix– “Multiple fingers” pointing to graph
49
Example
• Input string: -23
• NFA states:– Start:{0,1}– “-” :{1}– “2” :{2, 3}– “3” :{2, 3}
• But this is very difficult to implement directly.
0 1
-2
0-9
3
0-9
50
NFA DFA conversion• Can convert NFA directly to DFA by same
approach• Create one DFA state for each distinct
subset of NFA states that could arise• States: {0,1}, {1}, {2, 3}
• Called the “subset construction”
0 1
20-9
3
0-9
{0,1} {1}
{2,3}
-
0-90-9
0-9
-
51
Algorithm• For a set S of states in the NFA, compute
ε-closure(S) = set of states reachable from states in S by one or more ε-transitions
T = SRepeat T = T U {s’ | sT, (s,s’) is ε-transition}Until T remains unchangedε-closure(S) = T
• For a set S of ε-closed states in the NFA, compute DFAedge(S,c) = the set of states reachable from states in S by transitions on symbol c and ε-transitions DFAedge(S,c) = ε-closure( { s’ | sS, (s,s’) is c-
transition} )
52
AlgorithmDFA-initial-state = ε-closure(NFA-initial-state)Worklist = { DFA-initial-state }
While ( Worklist not empty )Pick state S from WorklistFor each character c
S’ = DFAedge(S,c)if (S’ not in DFA states)
Add S’ to DFA states and worklist
Add an edge (S, S’) labeled c in DFA
For each DFA-state SIf S contains an NFA-final state
Mark S as DFA-final-state
53
Putting the Pieces Together
RE NFAConversionNFA DFAConversion
DFASimulation
Yes, if w L(R)
No, if w L(R)InputString
Regular Expression
R
w
54
OPTIMIZATIONS
55
State minimization
• State Minimization is an optimization that converts a DFA to another DFA that recognizes the same language and has a minimum number of states.– Divide all states into “equivalence”
groups.• Two states p and q are equivalent if for all
symbols, the outgoing edges either lead to error or the same destination group.
• Collapse the states of a group to a single state (instead of p and q, have a single state).
56
State Minimization (Equivalence)
• More formally, all states in group Gi are equivalent iff for any two states p and q in Gi, and for every symbol σ, transition(p,σ) and transition(q,σ) are either both Error, or are states in the same group Gj (possibly Gi itself).
• For example:
p
q
Gi
GkGj
a
(or Error)
rb
a
b
a
b
c
c
c
57
58
• Step1. Partition states of original DFA into maximal-sized groups of “equivalent” states S = {G1, … ,Gn}
• Step 2. Construct the minimized DFA such that there is a state for each group Gi
State Minimization
0
1
2
a
4a b3
a
b
b
b
a
a a
b b
DFA Minimization• Step1. Partition states of original DFA into
maximal-sized groups of equivalent states– Step 1a. Discard states not reachable from start
state– Step 1b. Initial partition is S = {Final, Non-final}
– Step 1c. Repeatedly refine the partition {G1,…,Gn} while some group Gi contains states p and q such that for some symbol σ, transitions from p and q on σ are to different groups
p
q
Gi GkGj
a
a
(or Error)
j ≠ k
59
DFA Minimization• Step1. Partition states of original DFA into
maximal-sized groups of equivalent states– Step 1a. Discard states not reachable from start
state– Step 1b. Initial partition is S = {Final, Non-final}
– Step 1c. Repeatedly refine the partition {G1,…,Gn} while some group Gi contains states p and q such that for some symbol σ, transitions from p and q on σ are to different groups
60
GkGj
a
a
p
q
Gi
Gi’
(or Error)
j ≠ k
After state minimization.
• We have an optimized acceptor.
61
RE NFA
NFA DFA
DFASimulation
Yes, if w L(R)No, if w L(R)
InputString
Regular Expression
R
w
Minimize DFA
62
Lexical Analyzers vs Acceptors
• We really need a Lexer, not an acceptor.
• Lexical analyzers use the same mechanism, but they:– Have multiple RE descriptions for multiple
tokens– Output a sequence of matching tokens (or
an error)– Always return the longest matching token– For multiple longest matching tokens, use
rule priorities
63
Lexical Analyzers
RE NFANFA DFAMinimize DFADFASimulation
CharacterStream
REs for all valid Tokens
R1 … Rn
program Token stream(and errors)
64
Handling Multiple REs
whitespace
identifier
number
keywords
NFAsMinimized DFA
• Construct one NFA for each RE• Associate the final state of each NFA with the given RE• Combine NFAs for all REs into one NFA• Convert NFA to minimized DFA, associating each final
DFA state with the highest priority RE of the corresponding NFA states
65
Using Roll BackConsider three REs: {aa ba aabb] and input:
aaba
• Reach state 3 with no transition on next character a
• Roll input back to position on entering state 2 (i.e., having read aa)
• Emit token for aa• On next call to scanner, start in state 0 again
with input ba
0 1 4a a
3b b
65ab
2
Automatic Lexer Generators
• Input: token specification– list of regular expressions in priority order– associated action for each RE (generates
appropriate kind of token, other bookkeeping)
• Output: lexer program– program that reads an input stream and
breaks it up into tokens according to the REs (or reports lexical error -- “Unexpected character” )
66
Automatic Lexer (C) Generator
67
Example: Jlex (Java)%%digits = 0|[1-9][0-9]*letter = [A-Za-z]identifier = {letter}({letter}|[0-9_])*whitespace = [\ \t\n\r]+%%{whitespace} {/* discard */}{digits} { return new Token(INT,
Integer.parseInt(yytext()); }”if” { return new Token(IF, yytext()); }”while” { return new Token(WHILE, yytext()); }…{identifier} { return new Token(ID, yytext()); }
68
Example Output (Java)• Java Lexer which implements the
functionality described in the language specification.
• For instance : case 5: { return new Token(WHILE, yytext()); }
case 6: break; case 2: { return new Token(ID, yytext()); } case 7: break; case 4: { return new Token(IF, yytext());} case 8: break; case 1: { return new Token(INT,
Integer.parseInt(yytext());}
69
Start States
• Mechanism that specifies state in which to start the execution of the DFA
• Declare states in the second section– %state STATE
• Use states as prefixes of regular expressions in the third section:– <STATE> regex {action}
• Set current state in the actions– yybegin(STATE)
• There is a pre-defined initial state: YYINITIAL
70
Example
STRINGINITIAL .
”
”
if
%%%state STRING%%<YYINITIAL> “if” { return new Token(IF, null); }<YYINITIAL> “\”” { yybegin(STRING); … }<STRING> “\”” { yybegin(YYINITIAL); …
}<STRING> . { … }
71
72
Summary
• Lexical analyzer converts a text stream to tokens
• Ad-hoc lexers hard to get right, maintain • For most languages, legal tokens are
conveniently and precisely defined using regular expressions
• Lexer generators generate lexer automaton automatically from token RE’s, prioritization
73
Summary
• To write your own Lexer:– Describe tokens using Regular Expressions.– Construct NFAs for those tokens.
• If you have no ambiguities in the NFA, or you have a DFA directly from the regular expressions, you are done.
– Construct DFA from NFA using the algorithm described.
– Systematically implement the DFA using transition tables.
74
Reading• IC Language spec• JLEX manual• CVS manual• Links on course web home page• Regular Expression Matching Can Be Simple And Fast (but is
slow in Java, Perl, PHP, Python, Ruby, ...), Russ Cox, January 2007
AcknowledgementThe slides are based on similar content by Tim Teitelbaum, Cornell.