compiler construction lent term 2021 lecture 2 : lexical
TRANSCRIPT
![Page 1: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/1.jpg)
1
Compiler Construction
Lent Term 2021
Lecture 2 : Lexical analysis
Timothy G. Griffin
Computer Laboratory
University of Cambridge
• Recall regular expressions
• Recall Finite Automata
• Recall NFA to DFA transformation
• What is the “lexing problem”?
• How DFAs are used to solve the lexing
problem?
1
![Page 2: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/2.jpg)
What problem are we solving?
if m = 0 then 1 else if m = 1 then 1 else fib (m - 1) + fib (m -2)
Translate a sequence of characters
into a sequence of tokens
type token =
| INT of int| IDENT of string | LPAREN | RPAREN
| ADD | SUB | EQUAL | IF | THEN | ELSE
| …
IF, IDENT “m”, EQUAL, INT 0, THEN, INT 1, ELSE, IF,
IDENT “m”, EQUAL, INT 1, THEN, INT 1, ELSE, IDENT “fib”,
LPAREN, IDENT “m”, SUB, INT 1, RPAREN, ADD,
IDENT “fib”, LPAREN, IDENT “m”, SUB, INT 2, RPAREN
implemented with some data type
2
![Page 3: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/3.jpg)
)(||||| * aeeeeeae
)()(
)()(
}{)(
)}(),(|{)(
)()()(
}{)(
}{)(
{})(
)(
0
*
1
0
22112121
2121
*
n
n
nn
eMeM
eeMeM
eM
eMweMwwweeM
eMeMeeM
aaM
M
M
eM
alphabet over sexpressionRegular e
3
![Page 4: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/4.jpg)
Regular Expression (RE) Examples
},,,
,,,,,{
))(( *
aaaabbbbabbbaabb
ababbaaabbbaabbaabbabb
abbbaM
},,
,,,
,,,{
))(( *
M
4
![Page 5: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/5.jpg)
Review of Finite Automata (FA)
),,,,( 0 FqQM
states :Q alphabet :
statestart : Q0 q states final :QF
(DFA)FA ticdeterminisfor
Qa)(q,,, aQq
(NFA)FA nisticnondetermifor
Qa)(q,}),{(, aQq
5
![Page 6: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/6.jpg)
NFA Example
)cbbcaabM(a ** **
acceptingNFA An
c
c
a
b
a
b
start
6
![Page 7: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/7.jpg)
A bit of notation
},|{)(
and ),( if
0
322131
qqFqwML
qqqaqqq
w
waw
ε
For deterministic FA.
For nondeterministic FA.
},|{)(
and),(q if
and),( if
0
321231
321231
qqFqwML
qqaqqq
qqqqqq
w
waw
ww
ε
7
![Page 8: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/8.jpg)
Review of RE -> NFA
e startq finalq
A regular
expression.
A nondeterministic
FA accepting M(e) with
a single final state.
The construction is done by induction on
the structure of e.
)(eN
8
![Page 9: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/9.jpg)
Review of RE -> NFA
0q 1q)(N
0q 1q)(N
0q 1q)(aNa
9
![Page 10: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/10.jpg)
Review of RE -> NFA
)( 21 eeN
)( 1eN
)( 2eN
10
![Page 11: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/11.jpg)
Review of RE -> NFA
)( 21eeN
)( 2eN)( 1eN
11
![Page 12: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/12.jpg)
Review of RE -> NFA
)( *eN
)(eN
12
![Page 13: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/13.jpg)
))(( *abbbaN
a
b
ab b
0 1
2 3
6
54
10 9 8
7
start
13
![Page 14: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/14.jpg)
Review of NFA -> DFA
),,,,( 0 FqQM
),,,,( ''
0
''' FqQM
}|{
}{
})|),('({),(
}',|'{)(
}|{
'
0
'
0
'
'
FSQSF
qclosureq
SqaqqclosureaS
qqSqQqSclosure
QSSQ
14
![Page 15: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/15.jpg)
?)( compute wedo How Sclosure
resultreturn
stackon u push
result :result then
result if
each for
stack theoff pop
empty not stack while
:result
stack a onto of elements allpush
:)(
{u}
u
)(q,u
q
S
S
Sclosure
Look familiar?
It’s just a version of
transitive closure!
15
![Page 16: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/16.jpg)
)))((( *abbbaNDFA
10,7,6
,5,4,2,1
7,4
,2,1,0
8,7,6
,4,3,2,1
9,7,6
5,4,2,1b
a
ba
startb
b
b
aa
a
7,6
5,4,2,1
16
![Page 17: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/17.jpg)
17
Traditional Regular Language Problem
?)( is , and Given eLwwe
Solution : construct NFA from e, then DFA, then run
the DFA on w.
But is this a solution to the “lexing problem?
No!
17
![Page 18: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/18.jpg)
Something closer to the “lexing problem”
.
The expressions are ordered by priority. Why?
Is “if” a variable or a keyword? Need priority to
resolve ambiguity (so “if” matched keyword RE
before identifier RE.
We need to do a longest match. Why?
Is “ifif” a variable or two “if” keywords?
w
)w(i),w,(i),w,(i nn,21 ...21
keee ,, 21 and
find
Given
so that
n1 www=w ...2 )L(ewjijii
and what else?
and
18
![Page 19: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/19.jpg)
19
Define Tokens with Regular Expressions (Finite
Automata)
Keyword: if
1i
2f
3
1i
2f
3
0
-{f}
-{i}
This FA is really shorthand for:
“dead state” 19
![Page 20: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/20.jpg)
20
Define Tokens with Regular Expressions (Finite
Automata)
Keyword:
if1
i2
f3 KEY(IF)
Keyword:
then1
t2
h3
KEY(then)
5
e
n4
Regular Expression Finite Automata Token
Identifier:
[a-zA-Z][a-zA-Z0-9]*1 2
[a-zA-Z]
[a-zA-Z0-9]
ID(s)
20
![Page 21: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/21.jpg)
21
Define Tokens with Regular Expressions (Finite
Automata)
Regular Expression Finite Automata Token
number:
[0-9][0-9]*1 2
[0-9]
[0-9]
NUM(n)
real:
([0-9]+ ‘.’ [0-9]*)
| ([0-9]* ‘.’ [0-9]+)
1
3
[0-9] NUM(n) 2
[0-9]
[0-9]
.
4
.
[0-9]5
[0-9]21
![Page 22: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/22.jpg)
22
No Tokens for “White-Space”
White-space with one line
comments starting with %
1
3
%2
[A-za-z0-9’ ‘]
4
\n
\t
\n‘ ‘
22
![Page 23: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/23.jpg)
Constructing a Lexer
an ordered list of regular expressions
Highest priority first, lowest last
INPUT: keee ,, 21
keeee 21for NFA
23
priority.highest
of with theassociated
state finaleach DFA with
ie
![Page 24: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/24.jpg)
Constructing a Lexer
(1) Keyword : then
(2) Ident : [a-z][a-z]*
(2) White-space: ‘ ‘
24
1t
2:IDh
3:ID
5:THEN
e
n
4:ID
7:W
‘ ‘
6:ID
State 5 could accept
either an ID or
the keyword “then”.
The priority rules
eliminates this
ambiguity and
associates state 5
with the keyword.
![Page 25: Compiler Construction Lent Term 2021 Lecture 2 : Lexical](https://reader034.vdocument.in/reader034/viewer/2022050313/626fa37a9b5d4d741537ec77/html5/thumbnails/25.jpg)
25
What about longest match?
1t
2:IDh
3:ID
5:THEN
e
n
4:ID
7:W
‘ ‘
6:ID
|then thenx$ 1 0
t|hen thenx$ 2 2
th|en thenx$ 3 3
the|n thenx$ 4 4
then| thenx$ 5 5
then |thenx$ 0 5 EMIT KEY(THEN)
then| thenx$ 1 0 RESET
then |thenx$ 7 7
then t|henx$ 0 7 EMIT WHITE(‘ ‘)
then |thenx$ 1 0 RESET
then t|henx$ 2 2
then th|enx$ 3 3
then the|nx$ 4 4
then then|x$ 5 5
then thenx|$ 6 6
then thenx$| 0 6 EMIT ID(thenx)
Start in initial state,
Repeat:
(1) read input until dead state is
reached. Emit token associated
with last accepting state.
(2) reset state to start state
| = current position, $ = EOF
Input
current state
last accepting state