implementation of lexical analysis - stanford university
TRANSCRIPT
![Page 1: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/1.jpg)
1
CS143Lecture 4
Implementation of Lexical Analysis
Instructor: Fredrik KjolstadSlide design by Prof. Alex Aiken, with modifications
![Page 2: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/2.jpg)
2
Written Assignments
• WA1 assigned today
• Due in one week– 11:59pm– Electronic hand-in on Gradescope
![Page 3: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/3.jpg)
3
Tips on Building Large Systems
• KISS (Keep It Simple, Stupid!)
• Don’t optimize prematurely
• Design systems that can be tested
• It is easier to modify a working system than to get a system working
![Page 4: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/4.jpg)
Value simplicity
4
“It's not easy to write good software. […] it has a lot to do with valuing simplicity over complexity.”- Barbara Liskov
“Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.”- Brian Kernighan
“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”- Tony Hoare
“Simplicity does not precede complexity, but follows it.”- Alan Perlis
![Page 5: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/5.jpg)
5
Outline
• Specifying lexical structure using regular expressions
• Finite automata– Deterministic Finite Automata (DFAs)– Non-deterministic Finite Automata (NFAs)
• Implementation of regular expressions
![Page 6: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/6.jpg)
6
Convert Regular Expressions to Finite Automata
• High-level sketch
Regularexpressions
NFA
DFA
LexicalSpecification
Table-driven Implementation of DFA
Lexer → Regex → NFA → DFA → Tables
![Page 7: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/7.jpg)
7
Notation
• There is variation in regular expression notation
• Union: A + B ≡ A | B• Option: A + ε ≡ A?• Range: ‘a’+’b’+…+’z’ ≡ [a-z]• Excluded range:
complement of [a-z] ≡ [^a-z]
Lexer → Regex → NFA → DFA → Tables
![Page 8: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/8.jpg)
8
Regular Expressions in Lexical Specification
• Last lecture: a specification for the predicate s ∈ L(R) • But a yes/no answer is not enough!• Instead: partition the input into tokens
• We will adapt regular expressions to this goal
Lexer → Regex → NFA → DFA → Tables
![Page 9: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/9.jpg)
9
Lexical Specification → Regex in five steps
1. Write a regex for each token• Number = digit +• Keyword = ‘if’ + ‘else’ + …• Identifier = letter (letter + digit)*• OpenPar = ‘(‘• …
Lexer → Regex → NFA → DFA → Tables
![Page 10: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/10.jpg)
10
Lexical Specification → Regex in five steps
2. Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Number + … = R1 + R2 + …
(This step is done automatically by tools like flex)
Lexer → Regex → NFA → DFA → Tables
![Page 11: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/11.jpg)
11
Lexical Specification → Regex in five steps
3. Let input be x1…xnFor 1 ≤ i ≤ n check
x1…xi ∈ L(R)
4. If success, then we know that x1…xi ∈ L(Rj) for some j
5. Remove x1…xi from input and go to (3)
Lexer → Regex → NFA → DFA → Tables
![Page 12: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/12.jpg)
12
Ambiguity 1
• There are ambiguities in the algorithm
• How much input is used? What if• x1…xi ∈ L(R) and also• x1…xK ∈ L(R)
• Rule: Pick longest possible string in L(R) – Pick k if k > i– The “maximal munch”
Lexer → Regex → NFA → DFA → Tables
![Page 13: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/13.jpg)
13
Ambiguity 2
• Which token is used? What if• x1…xi ∈ L(Rj) and also• x1…xi ∈ L(Rk)
• Rule: use rule listed first– Pick j if j < k– E.g., treat “if” as a keyword, not an identifier
Lexer → Regex → NFA → DFA → Tables
![Page 14: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/14.jpg)
14
Error Handling
• What if No rule matches a prefix of input ?
• Problem: Can’t just get stuck …• Solution:
– Write a rule matching all “bad” strings– Put it last (lowest priority)
Lexer → Regex → NFA → DFA → Tables
![Page 15: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/15.jpg)
15
Summary
• Regular expressions provide a concise notation for string patterns
• Use in lexical analysis requires small extensions– To resolve ambiguities– To handle errors
• Good algorithms known– Require only single pass over the input– Few operations per character (table lookup)
Lexer → Regex → NFA → DFA → Tables
![Page 16: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/16.jpg)
16
Finite Automata
• Regular expressions = specification• Finite automata = implementation
• A finite automaton consists of– An input alphabet Σ– A set of states S– A start state n– A set of accepting states F ⊆ S– A set of transitions state →input state
Lexer → Regex → NFA → DFA → Tables
![Page 17: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/17.jpg)
17
Finite Automata
• Transitions1 →a s2
• Is readIn state s1 on input “a” go to state s2
• If end of input and in accepting state => accept
• Otherwise => reject
Lexer → Regex → NFA → DFA → Tables
![Page 18: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/18.jpg)
18
Finite Automata State Graphs
• A state
• The start state
• An accepting state
• A transitiona
Lexer → Regex → NFA → DFA → Tables
![Page 19: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/19.jpg)
19
A Simple Example
• A finite automaton that accepts only “1”
1
Lexer → Regex → NFA → DFA → Tables
0,1
0,1
0
![Page 20: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/20.jpg)
20
Another Simple Example
• A finite automaton accepting any number of 1’s followed by a single 0
• Alphabet: {0,1}
0
1
Lexer → Regex → NFA → DFA → Tables
0,1
0
![Page 21: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/21.jpg)
21
And Another Example
• Alphabet {0,1}• What language does this recognize?
0
1
0
1
0
1
Lexer → Regex → NFA → DFA → Tables
![Page 22: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/22.jpg)
22
Epsilon Moves in NFAs
• Another kind of transition: ε-movesε
• Machine can move from state A to state B without reading input
• Only exist in NFAs
A B
Lexer → Regex → NFA → DFA → Tables
![Page 23: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/23.jpg)
23
Deterministic and Nondeterministic Automata
• Deterministic Finite Automata (DFA)– Exactly one transition per input per state – No ε-moves
• Nondeterministic Finite Automata (NFA)– Can have zero, one, or multiple transitions for one
input in a given state– Can have ε-moves
Lexer → Regex → NFA → DFA → Tables
![Page 24: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/24.jpg)
24
Execution of Finite Automata
• A DFA can take only one path through the state graph– Completely determined by input
• NFAs can choose– Whether to make ε-moves– Which of multiple transitions for a single input to take
Lexer → Regex → NFA → DFA → Tables
![Page 25: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/25.jpg)
25
Acceptance of NFAs
• An NFA can get into multiple states
• Input:
0
1
0
0
1 0 0
Rule: NFA accepts if it can get to a final state
Lexer → Regex → NFA → DFA → Tables
![Page 26: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/26.jpg)
26
NFA vs. DFA (1)
• NFAs and DFAs recognize the same set of languages (regular languages)
• DFAs are faster to execute– There are no choices to consider
Lexer → Regex → NFA → DFA → Tables
![Page 27: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/27.jpg)
27
NFA vs. DFA (2)
• For a given language NFA can be simpler than DFA
01
0
0
01
0
1
0
1
NFA
DFA
• DFA can be exponentially larger than NFA
Lexer → Regex → NFA → DFA → Tables
![Page 28: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/28.jpg)
28
Convert Regular Expressions to NFA (1)
• For each kind of rexp, define an NFA– Notation: NFA for rexp M
M
• For εε
• For input aa
Lexer → Regex → NFA → DFA → Tables
![Page 29: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/29.jpg)
29
Convert Regular Expressions to NFA (2)
• For ABA B
ε
• For A + B
A
B
ε
ε
ε
ε
Lexer → Regex → NFA → DFA → Tables
![Page 30: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/30.jpg)
30
Convert Regular Expressions to NFA (3)
• For A*
A εε
ε
ε
Lexer → Regex → NFA → DFA → Tables
![Page 31: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/31.jpg)
31
Example of RegExp to NFA conversion
• Consider the regular expression(1+0)*1
• The NFA is
εε
εB
1C E0D F ε
εG
ε ε
ε
ε
A H 1I J
Lexer → Regex → NFA → DFA → Tables
![Page 32: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/32.jpg)
32
NFA to DFA: The Trick
• Simulate the NFA• Each state of DFA
= a non-empty subset of states of the NFA• Start state
= the set of NFA states reachable through ε-moves from NFA start state
• Add a transition S →a S’ to DFA iff– S’ is the set of NFA states reachable from any state in
S after seeing the input a, considering ε-moves as well
Lexer → Regex → NFA → DFA → Tables
![Page 33: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/33.jpg)
33
NFA to DFA. Remark
• An NFA may be in many states at any time
• How many different states ?
• If there are N states, the NFA must be in some subset of those N states
• How many subsets are there?– 2N - 1 = finitely many
Lexer → Regex → NFA → DFA → Tables
![Page 34: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/34.jpg)
34
NFA -> DFA Exampleε
10 1ε
ε
ε
ε
ε
ε ε
ε
A BC
D
E
FG H I J
FGHIABCD
EJGHIABCDABCDHI
0
1
0
10 1
Lexer → Regex → NFA → DFA → Tables
![Page 35: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/35.jpg)
35
Implementation
• A DFA can be implemented by a 2D table T– One dimension is “states”– Other dimension is “input symbol”– For every transition Si →a Sk define T[i,a] = k
• DFA “execution”– If in state Si and input a, read T[i,a] = k and skip to state Sk– Very efficient
stat
es
input symbols
bcd
a b
a b
b b
a b
a0 1
Lexer → Regex → NFA → DFA → Tables
![Page 36: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/36.jpg)
36
Table Implementation of a DFA
S
T
U
0
1
0
10 1
0 1
S T U
T T U
U T U
Lexer → Regex → NFA → DFA → Tables
![Page 37: Implementation of Lexical Analysis - Stanford University](https://reader034.vdocument.in/reader034/viewer/2022050313/626fabccd4236372490f9607/html5/thumbnails/37.jpg)
37
Implementation (Cont.)
• NFA -> DFA conversion is at the heart of tools such as flex
• But, DFAs can be huge
• In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations
Lexer → Regex → NFA → DFA → Tables