implementation of lexical analysis (scanning)

30
1 Implementation of Lexical Analysis (Scanning) CS780(Prasad) L4Lexer 1 Adapted from material by: Prof. Alex Aiken and Prof. George Necula (UCB) Prof. Saman Amarasinghe (MIT) Outline Specifying lexical structure using regular expressions Recognizing tokens using finite automata Deterministic Finite Automata (DFAs) CS780(Prasad) L4Lexer 2 Non-deterministic Finite Automata (NFAs) Implementation of regular expressions RegExp => NFA => DFA => Tables Regular Expressions => Lexical Spec. 1. Write a regular expression for the lexemes of each token Number = digit + (Kleene plus) + (Kleene plus) Keyword = ‘if’ | ‘else’ | … (Union) (Union) Identifier = letter (letter | digit)* CS780(Prasad) L4Lexer 3 Identifier = letter (letter | digit)* OpenPar = ‘(’ 2. Construct R, matching all lexemes for all tokens R = Keyword | Identifier | Number | … = R 1 | R 2 | … (cont’d) 3. Let input be the sequence of characters x 1 …x n For each 1 i n, check if x 1 …x i L(R) 4. It must be that CS780(Prasad) L4Lexer 4 x 1 …x i L(R j ) for some j 5. Remove x 1 …x i from input and go to (3) Problem: There are ambiguities in the algorithm.

Upload: dohanh

Post on 12-Feb-2017

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Implementation of Lexical Analysis (Scanning)

1

Implementation of Lexical Analysis(Scanning)

CS780(Prasad) L4Lexer 1

Adapted from material by:Prof. Alex Aiken and Prof. George Necula (UCB)

Prof. Saman Amarasinghe (MIT)

Outline

• Specifying lexical structure using regular expressions

• Recognizing tokens using finite automataDeterministic Finite Automata (DFAs)

CS780(Prasad) L4Lexer 2

( )Non-deterministic Finite Automata (NFAs)

• Implementation of regular expressionsRegExp => NFA => DFA => Tables

Regular Expressions => Lexical Spec.

1. Write a regular expression for the lexemes of each token

• Number = digit + (Kleene plus)+ (Kleene plus)• Keyword = ‘if’ | ‘else’ | … (Union)(Union)• Identifier = letter (letter | digit)*

CS780(Prasad) L4Lexer 3

• Identifier = letter (letter | digit)*• OpenPar = ‘(’• …

2. Construct R, matching all lexemes for all tokensR = Keyword | Identifier | Number | …

= R1 | R2 | …

(cont’d)3. Let input be the sequence of characters

x1…xn• For each 1 ≤ i ≤ n, check if

x1…xi ∈ L(R)

4. It must be that

CS780(Prasad) L4Lexer 4

x1…xi ∈ L(Rj) for some j5. Remove x1…xi from input and go to (3)

Problem: There are ambiguities in the algorithm.

Page 2: Implementation of Lexical Analysis (Scanning)

2

Ambiguities

• How much input is used? What ifx1…xi ∈ L(R) and alsox1…xK ∈ L(R)

Rule: Pick longest possible string in L(R) The “maximal munch” << vs < , != vs !

CS780(Prasad) L4Lexer 5

• Which token is used? What ifx1…xi ∈ L(Rj) and alsox1…xi ∈ L(Rk)

Rule: use rule listed first (j if j < k)Treats “if” as a keyword not an identifier

Error Handling

• What ifNo rule matches a prefix of input ?

• Problem: Can get stuck …

CS780(Prasad) L4Lexer 6

• Solution:

Write a rule matching all “bad” stringsPut it last (catch-all clause)

Summary

• Regular expressions provide a concise notation for string patterns.

• Use in lexical analysis requires small extensions:

T l bi i i

CS780(Prasad) L4Lexer 7

To resolve ambiguitiesTo handle errors

• Good algorithms knownRequire only single pass over the inputFew operations per character (table lookup)

Parity Problem

1s. ofnumber even contains )(:

}1,0{*

*

ωω

ω

⇔→Σ

Σ∈=Σ

paritybooleanparity

CS780(Prasad) L4Lexer 8

EP OP

1

10 0

Finite automaton = Recognizer

Page 3: Implementation of Lexical Analysis (Scanning)

3

Basic Features

• Consumes the entire input string.• Remembers the parity of the bit string by

abstracting from the number of 1s in the string.• Finite amount of memory required for this

purpose

CS780(Prasad) L4Lexer 9

purpose.Observe that counting requires unbounded memory, while computing the parity requires very small and fixed amount of memory.

• Accepts/Rejects the input in a deterministic fashion.

Deterministic Finite State Automaton (DFA)

Q: Finite set of statesΣ: Finite Alphabet

),,,,( 0 FqQM δΣ=

CS780(Prasad) L4Lexer 10

pδ: Transition function

total function from Q x Σ to Q: Initial/Start State

F : Set of final/accepting state0q

Finite Automata State Graphs

• The start state

• State

CS780(Prasad) L4Lexer 11

• An accepting state

• A transitiona

Operation of the machine

• Read the current letter of input under the

0 0 1

0q

Input Tape

Finite Control

CS780(Prasad) L4Lexer 12

• Read the current letter of input under the tape head.

• Transit to a new state depending on the current input and the current state, as dictated by the transition function.

• Halt after consuming the entire input.

Page 4: Implementation of Lexical Analysis (Scanning)

4

• Machine configuration:

• Yields relation:

Associating Language with DFA

∗Σ∈∈ ωω , where],[ Qqq

CS780(Prasad) L4Lexer 13

• Language:

]),,([ ],[ M ωδω aqaq a

} ],[ ],[ |{ *M0

* Fqqq ∈∧Σ∈444 3444 21

a εωω

• Set of strings over {a,b} that contain bb

• Design states by partitioning Σ*

∗∗ )|()|( babbba

Example

CS780(Prasad) L4Lexer 14

• Design states by partitioning Σ .Strings containing bb q2Strings not containing bb

Strings that end in b q1Strings that do not end in b q0

Initial state: q0Final state: q2

q2

State Diagram and Table

q0 q1

ab

a

b

a

b

CS780(Prasad) L4Lexer 15

bδ a bq0 q0 q1

q1 q0 q2

q2 q2 q2

}2{},{

}2,1,0{

qFba

qqqQ

==Σ=

],1[],0[ * εqaabq a

q0 q2

Strings over {a,b} that do not contain bb

q1

ab

a

b

a

b

CS780(Prasad) L4Lexer 16

bδ a bq0 q0 q1

q1 q0 q2

q2 q2 q2

}1,0{},{

}2,1,0{

qqFba

qqqQ

==Σ=

],0[],0[ * εqbaq a

Page 5: Implementation of Lexical Analysis (Scanning)

5

Strings over {a,b} containing even number of a’s and odd number of b’s.

Σ*

Ea OaEb Ob ObEb

CS780(Prasad) L4Lexer 17

b

b

b

b

aaa a

[Oa,Ob]

[Ea,Eb] [Ea,Ob]

[Oa,Eb]

Example

• Alphabet {0,1}• What language does this recognize?

1 0

CS780(Prasad) L4Lexer 18

0 0

11

Nondeterministic Finite Automata

qi qj

qkq

qi qja a

aDFA

NFA

a

CS780(Prasad) L4Lexer 19

relationsubset function total)(:

function total :

QQQΡowQ

QQ

NFA

NFA

DFA

×Σ×⊆→Σ×

→Σ×

δδ

δ

q2

NFA State Diagram (Strings over {a,b} ending in bb)

q0 q1

a

b b

}210{ qqqQ =

b

CS780(Prasad) L4Lexer 20

δ a bq0 {q0} {q0,q1}

q1 φ {q2}

q2 φ φ

}2{},{

}2,1,0{

qFba

qqqQ

==Σ=

bbba ∗)|(

Page 6: Implementation of Lexical Analysis (Scanning)

6

],0[],0[],0[],0[ εqbqbbqabbq aaa

],1[],0[],0[],0[ εqbqbbqabbq aaa

Halts in non-accepting state after consuming the input.

CS780(Prasad) L4Lexer 21

],2[],1[],0[],0[ εqbqbbqabbq aaa

Halts in accepting state after consuming the input.

)(NFALabb ∈

Introducing ε-transitions into NFA

• A ε-transition causes the machine to h i d i i i ll

)(}){(: QPQ →∪Σ× εδ

CS780(Prasad) L4Lexer 22

change its state non-deterministically, without consuming any input.

)()( NFAsLDFAsL ⊆

Deterministic and Nondeterministic Automata

• Deterministic Finite Automata (DFA)One transition per input per stateNo ε-moves

• Nondeterministic Finite Automata (NFA)

CS780(Prasad) L4Lexer 23

Can have multiple transitions for one input in a given stateCan have ε-moves

• Finite automata have finite memoryNeed only to encode the current state

How do we associate a language with NFA?• A DFA can take only one path through the

state graph

• NFAs can chooseWhether to make ε moves

CS780(Prasad) L4Lexer 24

Whether to make ε-moves

Or one of the multiple transitions for a single input to take

Accept if there exists a accepting computation.

Reject if all computations are non-accepting.

Page 7: Implementation of Lexical Analysis (Scanning)

7

• Every DFA is an NFA. However, does non-determinism make NFAs strictly more expressive (powerful) than DFAs?

)()( DFAsLNFAsL ⊇

CS780(Prasad) L4Lexer 25

For type 0 languages (Turning Machines) and type 3 languages (regular languages), non-determinism does not add expressive power.For context-free languages and context-sensitive languages, non-determinism does enhance the expressive power.

NFA State Diagram (a | b)* bb (a | b)*

q2q0 q1

a

b b

b ba

CS780(Prasad) L4Lexer 26

q2q0 q1

a

b b

ba

a

DFA

NFA for (a | b)* (aa | bb) (a | b)*

q2q0 q1

a

a a

b ba

CS780(Prasad) L4Lexer 27

q22q11b b

ba

NFA vs. DFA

01

0NFA

• Equivalent machines

CS780(Prasad) L4Lexer 28

0

01

0

1

0

1

DFA

Page 8: Implementation of Lexical Analysis (Scanning)

8

NFA vs. DFA

• NFAs and DFAs recognize the same set of languages (regular languages).

• DFAs are faster to execute

CS780(Prasad) L4Lexer 29

There are no choices to consider.

• For a given language, NFA can be simpler than DFA

• DFA can be exponentially larger than NFA.

Reg. Expr. to Finite Automata

• High-level sketch

Regular

NFA

CS780(Prasad) L4Lexer 30

Regularexpressions DFA

LexicalSpecification

Table-driven Implementation of DFA

Regular Expressions to NFA (1)

• For each kind of regular expr., define an NFANotation: NFA for rexp M

M

CS780(Prasad) L4Lexer 31

• For εε

• For input aa

• For φ

Regular Expressions to NFA (2)

A Bε

F A | B

• For AB

CS780(Prasad) L4Lexer 32

• For A | B

A

B

εε

ε

ε

Page 9: Implementation of Lexical Analysis (Scanning)

9

Regular Expressions to NFA (3)

A

ε

• For A*

CS780(Prasad) L4Lexer 33

A εε

ε

Thompson’s Construction

Reg. Expr. −> NFA conversion

• Consider the regular expression(1|0)*1

• The corresponding NFA is

CS780(Prasad) L4Lexer 34

10 1ε ε

ε

ε

ε

ε ε

ε

ε

A BC

D

E

FG H I J

Regular Expression −> NFA conversion

(1|0)*1(1|0)*1

CS780(Prasad) L4Lexer 35

10 1ε ε

ε

ε

ε

ε ε

ε

ε

A BC

D

E

FG H I J

NFA to DFA: The Trick• Simulate the NFA, in parallel

Michael Rabin and Dana Scott’s work

• Each state of DFA = a non-empty subset of states of the NFA

• Start state

CS780(Prasad) L4Lexer 36

= the set of NFA states reachable through ε-moves from NFA start state

• Add a transition S →a S’ to DFA iffS’ is the set of NFA states reachable from some state in S after seeing the input ‘a’, considering ε-moves as well

Page 10: Implementation of Lexical Analysis (Scanning)

10

NFA to DFA: Remark• An NFA may be in many states at any time.• How many different states ?

If there are N states, the NFA must be in some subset of those N states

• How many subsets are there?

CS780(Prasad) L4Lexer 37

2N - 1 = finitely many

• NFA −> DFA conversion is at the heart of tools such as flex. In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations. (DFA Minimization)

Myhill-Nerode’s work

NFA −> DFA Example

10 1ε ε

ε

ε

ε

ε ε

ε

A BC

D

E

FG H I J

CS780(Prasad) L4Lexer 38

ε

ABCDHI

FGHIABCD

EJGHIABCD

0

1

0

10 1

Implementation

• A DFA can be implemented by a 2D table TOne dimension is “states”Other dimension is “input symbol”For every transition Si →a Sk define T[i,a] = k

CS780(Prasad) L4Lexer 39

y i k [ ]• DFA “execution”

If in state Si and input ‘a’, read T[i,a] = k and skip to state Sk

Very efficient

Table Implementation of a DFA

S

T

U

0

1

0

10 1

CS780(Prasad) L4Lexer 40

0 1S T UT T UU T U

Page 11: Implementation of Lexical Analysis (Scanning)

11

Finite State Machine Minimization

Language over {H,T} : Strings with even number of H

CS780(Prasad) L4Lexer 41

Minimizing DFA: The Idea

• Two states s and t should be merged unless there is some way to tell them apart.

Initially, assume that all states are equivalent until proven otherwise.

CS780(Prasad) L4Lexer 42

proven otherwise.

• How can we tell if two states s and t are different? If one is accepting and the other is not.

If, on input c, s −> x and t −> y, and we already know x is different from y.

Minimizing DFA: The Algorithm• Overall, it partitions the set of states.

• Initial partition :

(Accepting States, Non-accepting states)

• Refinement :

In each pass, find a new partition for each state such

CS780(Prasad) L4Lexer 43

p , pthat

s and t are in the same new partition if and only if states s and t were in the same old partition, and, on each input c, states s and t go to states in the same partition.

Example DFA

q0 q1 q2 q3a

b

b

b

a

a,b a

CS780(Prasad) L4Lexer 44

q4 q7q5

q6

a a,b

aa,b

bb

(a u b)(a u b*)

Page 12: Implementation of Lexical Analysis (Scanning)

12

Refinement of State Partitions

• { {q0,q7} , {q1,q2,q3,q4,q5,q6} }• { {q0},{q7}, {q1,q2,q3,q4,q5,q6} }

On any transition

• { {q0},{q7}, {q1,q2,q3,q4,q5,q6} }

CS780(Prasad) L4Lexer 45

• { {q0},{q7}, {q1,q4}, {q2,q3,q5,q6} }On “a” transition

• { {q0},{q7}, {q1,q4}, {q2,q5},{q3,q6} }On “b” transition

Example DFA showing equivalent states

q0 q1 q2 q3a

b

b

b

a

a,b a

CS780(Prasad) L4Lexer 46

q4 q7q5

q6

a a,b

aa,b

bb

(a u b)(a u b*)

Example Minimum DFA

q0

a,b

CS780(Prasad) L4Lexer 47

q1,q4 q7q2,q5

q3,q6

a a,b

aa,b

b

,

b(a u b)(a u b*)

Summary

• Lexer creates tokens out of a text stream.• Tokens are defined using regular expressions.• Regular expressions can be mapped to

Nondeterministic Finite Automatons (NFA) • NFA is transformed to a DFA

By removing non determinism and

CS780(Prasad) L4Lexer 48

By removing non-determinism, andBy minimizing states.

• Executing a DFA is straightforward. • Common scanner generator tools

lex, flex in Cjlex in Java

Page 13: Implementation of Lexical Analysis (Scanning)

13

What’s Next?

Lexical Analyzer (Scanner)

Syntax Analyzer (Parser)

Token Stream

Parse Tree

Program (character stream)

49

Intermediate Code Optimizer

Code Generator

Optimized Intermediate Representation

Assembly code

Intermediate Code GeneratorIntermediate Representation

Animating Lexical Analysis

CS780(Prasad) L4Lexer 50

Lexical Analyzer in Action

f o r v a r 1 = 1 0 v a r 1 < =

CS780(Prasad) L4Lexer 51

Lexical Analyzer in Action

for ID(“var1”) eq_op Num(10) ID(“var1”) leq_op

f o r v a r 1 = 1 0 v a r 1 < =

CS780(Prasad) L4Lexer 52

Page 14: Implementation of Lexical Analysis (Scanning)

14

Lexical Analyzer in Action

f

for

o r v a r 1 = 1 0 v a r 1 < =

ID(“var1”) eq_op Num(10) ID(“var1”) leq_op

CS780(Prasad) L4Lexer 53

• Partition input program text into subsequence of characters corresponding to tokens

• Attach the corresponding attributes to the tokens• Eliminate white space and comments

Animating NFA construction

CS780(Prasad) L4Lexer 54

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (.·(0|1|2|3|4|5|6|7|8|9)*)?

24

CS780(Prasad) L4Lexer 55

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

CS780(Prasad) L4Lexer 56

ε

-

Page 15: Implementation of Lexical Analysis (Scanning)

15

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

CS780(Prasad) L4Lexer 57

ε

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε

CS780(Prasad) L4Lexer 58

ε

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε

012

CS780(Prasad) L4Lexer 59

ε 23456789

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε

012

CS780(Prasad) L4Lexer 60

ε 23456789

-ε ε

Page 16: Implementation of Lexical Analysis (Scanning)

16

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε

012

CS780(Prasad) L4Lexer 61

ε 23456789

-ε ε

ε

ε

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε

012

CS780(Prasad) L4Lexer 62

ε 23456789

-ε . ε

ε

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε

012

ε

CS780(Prasad) L4Lexer 63

ε 23456789

-ε . εε

εε

ε

012

ε

012

CS780(Prasad) L4Lexer 64

ε 23456789

23456789

-ε . εε

εε

Page 17: Implementation of Lexical Analysis (Scanning)

17

ε

012

ε

012

CS780(Prasad) L4Lexer 65

ε 23456789

23456789

-ε . εε

εε

Animating Token Recognition

CS780(Prasad) L4Lexer 66

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 67

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 68

ε

0123456789

0123456789

-ε . εε

εε

Page 18: Implementation of Lexical Analysis (Scanning)

18

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 69

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 70

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 71

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 72

ε

0123456789

0123456789

-ε . εε

εε

Page 19: Implementation of Lexical Analysis (Scanning)

19

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 73

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 74

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 75

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 76

ε

0123456789

0123456789

-ε . εε

εε

Page 20: Implementation of Lexical Analysis (Scanning)

20

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 77

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 78

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 79

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 80

ε

0123456789

0123456789

-ε . εε

εε

Page 21: Implementation of Lexical Analysis (Scanning)

21

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 81

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 82

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 83

ε

0123456789

0123456789

-ε . εε

εε

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 84

ε

0123456789

0123456789

-ε . εε

εε

Page 22: Implementation of Lexical Analysis (Scanning)

22

String Matching

ε

0

ε

0

- 1 2 . 8

CS780(Prasad) L4Lexer 85

ε

0123456789

0123456789

-ε . εε

εε

Animating Closure Construction

CS780(Prasad) L4Lexer 86

NFA to DFA conversion

1. Closure

• The closure of a state is the set of states that can be reached from that state without consuming any of the input

Closure(S) is the smallest set T such that

⎟⎟⎠

⎞⎜⎜⎝

⎛= UU sedgeST ),( ε

87'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

• Algorithm

⎟⎠

⎜⎝ ∈U

Ts

g )(

(-| ε) ·(0|1|2|3|4|5|6|7|8|9)+ · (. ·(0|1|2|3|4|5|6|7|8|9)*)?

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {}T’ = {}

L4Lexer 88

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

Page 23: Implementation of Lexical Analysis (Scanning)

23

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1}T’ = {}

89

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1}T’ = {1}

90

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1, 2}T’ = {1}

91

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1, 2}T’ = {1}

92

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

Page 24: Implementation of Lexical Analysis (Scanning)

24

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1, 2}T’ = {1, 2}

93

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1, 2}T’ = {1, 2}

94

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1, 2}T’ = {1, 2}

95

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

ε ε . ε

ε0

123

4

ε0

123

4

S = {1}T = {1, 2}T’ = {1, 2}

96

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

'

),('

'

'

TT

sedgeTT

TT

ST

Ts

=

⎟⎟⎠

⎞⎜⎜⎝

⎛←

until

repeat

UU ε

Page 25: Implementation of Lexical Analysis (Scanning)

25

Question: What is closure(3)?

ε ε . ε

ε0

123

4

ε0

123

4

S = {3}T = ??

CS780(Prasad) L4Lexer 97

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

Question: What is closure(3)?

ε ε . ε

ε0

123

4

ε0

123

4

S = {3}T = {2, 3, 4, 8}

CS780(Prasad) L4Lexer 98

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

Animating DFAEdge Construction

CS780(Prasad) L4Lexer 99

NFA to DFA conversion

2. DFAedge

• Given a symbol and a state,what states can you reach?

CS780(Prasad) L4Lexer 100

Page 26: Implementation of Lexical Analysis (Scanning)

26

What is DFAedge({1}, 3)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {1}

CS780(Prasad) L4Lexer 101

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({1}, 3) = ??

What is DFAedge({1}, 3)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {1}

edge({1}, 3) = {}

CS780(Prasad) L4Lexer 102

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({1}, 3) = {}

What is DFAedge({1}, 3)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {1}

closure({1}) = {1,2}

CS780(Prasad) L4Lexer 103

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({1}, 3) = {}

What is DFAedge({1}, 3)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {1, 2}

edge({2}, 3) = {3}

CS780(Prasad) L4Lexer 104

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({1}, 3) = {3}

Page 27: Implementation of Lexical Analysis (Scanning)

27

What is DFAedge({1}, 3)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {1, 2}

closure({3}) ={2,3,4,8}

CS780(Prasad) L4Lexer 105

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({1}, 3) = {2,3,4,8}

Question: What is DFAedge({3}, .)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {3}

CS780(Prasad) L4Lexer 106

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({3}, .) = ??

Question: What is DFAedge({3}, .)?

ε ε . ε

ε0

123

4

ε0

123

4

d = {3}

CS780(Prasad) L4Lexer 107

-ε . εε

ε

1 4 5 84567

8

9

32 4567

8

9

76

ε

DFAedge({3}, .) = {5,6,8}

Animating DFA Construction

CS780(Prasad) L4Lexer 108

NFA to DFA conversion

Page 28: Implementation of Lexical Analysis (Scanning)

28

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 109

closure({1}) ={1, 2}ε

ε

1,2

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 110

DFAedge({1,2},-) ={2}ε

ε

1,2

2

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 111

Closure({2}) ={2}ε

ε

1,2

2

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 112

3

DFAedge({1,2},0..9) ={3}ε

ε

1,2

2

-

Page 29: Implementation of Lexical Analysis (Scanning)

29

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 113

2,3,4,8

Closure({3}) ={2,3,4,8}ε

ε

1,2

2

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 114

2,3,4,8

DFAedge({2},0..9) ={3}Closure({3}) = {2,3,4,8} ε

ε

1,2

2

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 115

2,3,4,8

DFAedge({2,3,4,8},0..9) ={3}closure({3}) = {2,3,4,8} ε

ε

1,2

2

,4,5,6

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 116

5,6,82,3,4,8

DFAedge({2,3,4,8}, .) ={5}closure({5}) = {5,6,8} ε

ε

1,2

2

,4,5,6

.

-

Page 30: Implementation of Lexical Analysis (Scanning)

30

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 117

0,1,2,3,4,5,6,7,8,95,6,8 6,7,82,3,4,8

DFAedge({5,6,8},0..9) ={7}closure({7}) = {6,7,8} ε

ε

1,2

2

,4,5,6

.

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

CS780(Prasad) L4Lexer 118

0,1,2,3,4,5,6,7,8,95,6,8 6,7,82,3,4,8

DFAedge({6,7,8}, 0..9)={7}closure({7}) ={6,7,8} ε

ε

1,2

2

,4,5,6 ,4,5,6

.

-

ε

-ε . εε1 4 5 8

ε0

123

4567

8

9

32

ε0

123

4567

8

9

76

ε

NFA

CS780(Prasad) L4Lexer 119

0,1,2,3,4,5,6,7,8,95,6,8 6,7,82,3,4,8

ε

ε

1,2

2

,4,5,6 ,4,5,6

.

-

DFA

What is the minimal DFA for the following?

CS780(Prasad) L4Lexer 120