1 lexical analysis cheng-chia chen. 2 outline l introduction to lexical analyzer l tokens l regular...

91
Lexical Analysis Cheng-Chia Chen

Post on 19-Dec-2015

226 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

1

Lexical Analysis

Cheng-Chia Chen

Page 2: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

2

Outline Introduction to lexical analyzer Tokens Regular expressions (RE) Finite automata (FA)

» deterministic and nondeterministic finite automata (DFA and NFA)

» from RE to NFA» from NFA to DFA» from DFA to optimized DFA

Page 3: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

3

Outline Informal sketch of lexical analysis

» Identifies tokens in input string

Issues in lexical analysis» Lookahead» Ambiguities

Specifying lexers» Regular expressions» Examples of regular expressions

Page 4: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

4

The Structure of a Compiler

Source Tokens

Interm.Language

Lexicalanalysis

Parsing

CodeGen.

MachineCode

Optimization

Page 5: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

5

Lexical Analysis What do we want to do? Example:

if (i == j)Z = 0;

elseZ = 1;

The input is just a sequence of characters:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Goal: Partition input string into substrings» And determine the categories (tokens) to which the

substrings belong

Page 6: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

6

What’s a Token? Output of lexical analysis is a stream of tokens A token is a syntactic category

» In English:noun, verb, adjective, …

» In a programming language:Identifier, Integer, Keyword, Whitespace, …

Parser relies on the token distinctions: identifiers are treated differently than keywords

Page 7: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

7

Aspects of Token language view: a set of strings (belonging to that

token)» [while], [identifier], [arithOp]

Pattern (grammar): a rule defining a token» [while]: while» [identifier]: letter followed by letters and digits» [arithOp]: + or - or * or /

Lexeme (member): a string matched by the pattern of a token» while, var23, count, +, *

Page 8: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

8

Attributes of Tokens Attributes are used to distinguish different lexemes in

a token» [while,]» [identifier, var35]» [arithOp, +]» [integer, 26]» [string, “26”]

positional information: start/end line/position Tokens affect syntax analysis and attributes affect

semantic analysis and error handling

Page 9: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

9

Lexical Analyzer: Implementation

An implementation must do two things:

1. Recognize substrings corresponding to tokens

2. Return the attributes( lexeme and positional information) of the token– The lexeme is either the substring or some data

object constructed from the substring.

Page 10: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

10

Example input lines:

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Token-lexeme pairs returned by the lexer:» [Whitespace, “\t”]» [if, ]» [OpenPar, “(“] » [Identifier, “i”]» [Relation, “==“]» [Identifier, “j”]» …

Page 11: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

11

Lexical Analyzer: Implementation

The lexer usually discards “uninteresting” tokens that don’t contribute to parsing.

Examples: Whitespace, Comments

Question: What happens if we remove all whitespace and all comments prior to lexing?

Page 12: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

12

Lexical Analysis in FORTRAN FORTRAN rule: Whitespace is insignificant

E.g., VAR1 is the same as VA R1

Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators

Page 13: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

13

A terrible design! Example Consider

» DO 5 I = 1,25» DO 5 I = 1.25

The first is DO 5 I = 1 , 25 The second is DO5I = 1.25

Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached

Page 14: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

14

Lexical Analysis in FORTRAN. Lookahead.

Two important points:

1. The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one token ends and the next token begins

» Even our simple example has lookahead issues

i vs. if = vs. ==

Page 15: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

15

Lexical Analysis in PL/I PL/I keywords are not reserved

IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

PL/I Declarations:

DECLARE (ARG1,. . ., ARGN)

Can’t tell whether DECLARE is a keyword or array reference until after the ).» Requires arbitrary lookahead!

Page 16: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

16

Review The goal of lexical analysis is to

» Partition the input string into lexemes» Identify the token of each lexeme

Left-to-right scan => lookahead sometimes required

Page 17: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

17

Next We still need

» A way to describe the lexemes of each token

» A way to resolve ambiguities– Is if two variables i and f?– Is == two equal signs = =?

Page 18: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

18

Regular Expressions and Regular Languages

There are several formalisms for specifying tokens

Regular languages are the most popular» Simple and useful theory» Easy to understand» Efficient implementations

Page 19: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

19

Languages

Def. Let be a set of characters. A language over is a set of strings of characters drawn from

( is called the alphabet )

Page 20: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

20

Examples of Languages Alphabet = English

characters Language = English

words

Not every string on English characters is an English word» likes, school,…» beee,yykk,…

Alphabet = ASCII Language = C programs

Note: ASCII character set is different from English character set

Page 21: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

21

Notation Languages are sets of strings.

Need some notation for specifying which sets we want

For lexical analysis we care about regular languages, which can be described using regular expressions.

Page 22: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

22

Regular Expressions and Regular Languages

Each regular expression is a notation for a regular language (a set of words)

If A is a regular expression then we write L(A) to refer to the language denoted by A

Page 23: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

23

Atomic Regular Expressions Single character: c

L(c) = { c } (for any c 2 ) Epsilon (empty string):

L() = {}

Page 24: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

24

Compound Regular Expressions

Union (or choice)

A | B = { s | s 2 A or s 2 B } Concatenation: AB (where A and B are reg. exp.)

L(A B) = { | 2 L(A) and 2 L(B) } Note:

» AB (set concatenation) and (string concatenation) will be abbreviated to AB and , respectively.

Examples:» if | then | else = { if, then, else}» 0 | 1 | … | 9 = { 0, 1, …, 9 }» Another example: (0 | 1) (0 | 1) = { 00, 01, 10, 11 }

Page 25: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

25

More Compound Regular Expressions

So far we do not have a notation for infinite languages

Iteration: A*

L(A*) = { } [ L(A) [ L(AA) [ L(AAA) [ … Examples:

0* = {, 0, 00, 000, …}

1 0* = { strings starting with 1 and followed by 0’s }

Page 26: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

26

Example: Keyword

» Keyword: else or if or begin …

else | if | begin | …

Page 27: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

27

Example: IntegersInteger: a non-empty string of digits

( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )*

problem: reuse complicated expression improvement: define intermediate reg. expr.

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

number = digit digit*

Abbreviation: A+ = A A*

Page 28: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

28

Regular Definitions Names for regular expressions

» d1 =r1

» d2 =r2

» ...

» dn =rn

where ri over alphabet {d1, d2, ..., d i-1}

note: Recursion is not allowed.

Page 29: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

29

Example» Identifier: strings of letters or digits, starting

with a letter

digit = 0 | 1 | ... | 9

letter = A | … | Z | a | … | z

identifier = letter (letter | digit) *

» Is (letter* | digit*) the same ?

Page 30: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

30

Example: WhitespaceWhitespace: a non-empty sequence of blanks,

newlines, CRNL and tabs

WS = (\ | \t | \n | \r\n )+

Page 31: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

31

Example: Email Addresses Consider [email protected]

= letters [ { ., @ }

name = letter+

address = name ‘@’ name (‘.’ name)*

Page 32: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

32

Notational Shorthands One or more instances

» r+ = r r*» r* = r+ |

Zero or one instance» r? = r |

Character classes» [abc] = a | b | c» [a-z] = a | b | ... | z» [ac-f] = a | c | d | e | f» [^ac-f] = – [ac-f]

Page 33: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

33

Summary Regular expressions describe many useful

languages

Regular languages are a language specification» We still need an implementation

problem: Given a string s and a rexp R, is

( )?s L R

Page 34: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

34

Implementation of Lexical Analysis

Page 35: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

35

Outline Specifying lexical structure using regular

expressions

Finite automata» Deterministic Finite Automata (DFAs)» Non-deterministic Finite Automata (NFAs)

Implementation of regular expressions

RegExp => NFA => DFA =>optimized DFA

=> Tables

Page 36: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

36

Regular Expressions in Lexical Specification

Last lecture: a specification for the predicate

s L(R) But a yes/no answer is not enough ! Instead: partition the input into lexemes

We will adapt regular expressions to this goal

Page 37: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

37

Regular Expressions => Lexical Spec. (1)

1. Select a set of tokens• Number, Keyword, Identifier, ...

2. Write a rexp for the lexemes of each token• Number = digit+

• Keyword = if | else | …• Identifier = letter (letter | digit)*• OpenPar = ‘(‘• …

Page 38: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

38

Regular Expressions => Lexical Spec. (2)

3. Construct R, matching all lexemes for all tokens

R = Keyword | Identifier | Number | …

= R1 | R2 | R3 + …

Facts: If s 2 L(R) then s is a lexeme

» Furthermore s 2 L(Ri) for some “i”

» This “i” determines the token that is reported

Page 39: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

39

Regular Expressions => Lexical Spec. (3)

4. Let the input be x1…xn

(x1 ... xn are characters in the language alphabet)

• For 1 i n check

x1…xi L(R) ?

5. It must be that

x1…xi L(Rj) for some j

6. Remove x1…xi from input and go to (4)

Page 40: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

40

How to Handle Spaces and Comments?

1. We could create a token Whitespace Whitespace = (‘ ’ + ‘\n’ + ‘\r’ + ‘\t’)+

» We could also add comments in there» An input “ \t\n 5555 “ is transformed into Whitespace Integer Whitespace

2. Lexer skips spaces (preferred)• Modify step 5 from before as follows:

It must be that xk ... xi 2 L(Rj) for some j such that x1 ... xk-1 2 L(Whitespace)

• Parser is not bothered with spaces

Page 41: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

41

Ambiguities (1) There are ambiguities in the algorithm

How much input is used? What if

– x1…xi L(R) and also

– x1…xK L(R)

» Rule: Pick the longest possible substring

» The maximal match

Page 42: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

42

Ambiguities (2) Which token is used? What if

– x1…xi L(Rj) and also

– x1…xi L(Rk)

» Rule: use rule listed first (j if j < k)

Example:

» R1 = Keyword and R2 = Identifier

» “if” matches both.

» Treats “if” as a keyword not an identifier

Page 43: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

43

Error Handling What if

No rule matches a prefix of input ? Problem: Can’t just get stuck … Solution:

» Write a rule matching all “bad” strings» Put it last

Lexer tools allow the writing of:

R = R1 | ... | Rn | Error

» Token Error matches if nothing else matches

Page 44: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

44

Summary Regular expressions provide a concise notation

for string patterns Use in lexical analysis requires small extensions

» To resolve ambiguities» To handle errors

Good algorithms known (next)» Require only single pass over the input» Few operations per character (table lookup)

Page 45: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

45

Finite Automata Regular expressions = specification Finite automata = implementation

A finite automaton consists of» An input alphabet » A set of states S» A start state n» A set of accepting states F S» A set of transitions state input state

Page 46: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

46

Finite Automata Transition

s1 a s2

Is read

In state s1 on input “a” go to state s2

If end of input (or no transition possible)» If in accepting state => accept» Otherwise => reject

Page 47: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

47

Finite Automata State Graphs

A state

• The start state

• An accepting state

• A transitiona

Page 48: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

48

A Simple Example A finite automaton that accepts only “1”

1

Page 49: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

49

Another Simple Example A finite automaton accepting any number of 1’s

followed by a single 0 Alphabet: {0,1}

0

1

Page 50: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

50

And Another Example Alphabet {0,1} What language does this recognize?

0

1

0

1

0

1

Page 51: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

51

And Another Example Alphabet still { 0, 1 }

The operation of the automaton is not completely defined by the input» On input “11” the automaton could be in either

state

1

1

Page 52: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

52

Epsilon Moves Another kind of transition: -moves

• Machine can move from state A to state B without reading input

A B

Page 53: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

53

Deterministic and Nondeterministic Automata

Deterministic Finite Automata (DFA)» One transition per input per state » No -moves

Nondeterministic Finite Automata (NFA)» Can have multiple transitions for one input in a

given state» Can have -moves

Finite automata have finite memory» Need only to encode the current state

Page 54: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

54

Execution of Finite Automata A DFA can take only one path through the state

graph» Completely determined by input

NFAs can choose» Whether to make -moves» Which of multiple transitions for a single input

to take

Page 55: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

55

Acceptance of NFAs An NFA can get into multiple states

• Input:

0

1

1

0

1 0 1

• Rule: NFA accepts it it can get in a final state

Page 56: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

56

Acceptance of a Finite Automata

A FA (DFA or NFA) accepts an input string s iff there is some path in the transition diagram from the start state to some final state such that the edge labels along this path spell out s

Page 57: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

57

NFA vs. DFA (1) NFAs and DFAs recognize the same set of

languages (regular languages)

DFAs are easier to implement» There are no choices to consider

Page 58: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

58

NFA vs. DFA (2) For a given language the NFA can be simpler

than the DFA

01

0

0

01

0

1

0

1

NFA

DFA

• DFA can be exponentially larger than NFA

Page 59: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

59

Operations on NFA states -closure(s): set of NFA states reachable from NFA

state s on -transitions alone -closure(S): set of NFA states reachable from some

NFA state s in S on -transitions alone move(S, c): set of NFA states to which there is a

transition on input symbol c from some NFA state s in S

notes: » -closure(S) = Us S∈ -closure(S);» -closure(s) = Us S∈ -closure({s});» -closure(S) = ?

Page 60: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

60

Computing -closure Input. An NFA and a set of NFA states S. Output. E = -closure(S).begin

push all states in S onto stack; T := S;while stack is not empty do begin

pop t, the top element, off of stack;for each state u with an edge from t to u labeled do

if u is not in T do begin add u to T; push u onto stackend

end;return T

end.

Page 61: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

61

Simulating an NFA Input. An input string ended with eof and an NFA with start

state s0 and final states F. Output. The answer “yes” if accepts, “no” otherwise.begin

S := -closure({s0});c := next_symbol();while c != eof do beginS := -closure(move(S, c));c := next_symbol();end;if S F != then return “yes”else return “no”

end.

Page 62: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

62

Regular Expressions to Finite Automata

High-level sketch

Regularexpressions

NFA DFA

LexicalSpecification

Table-driven Implementation of DFA

Optimized DFA

Page 63: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

63

Regular Expressions to NFA (1)

For each kind of rexp, define an NFA» Notation: NFA for rexp A

A

• For

• For input aa

Page 64: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

64

Regular Expressions to NFA (2)

For AB

A B

• For A + B

A

B

Page 65: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

65

Regular Expressions to NFA (3)

For A*

A

Page 66: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

66

Example of RegExp -> NFA conversion

Consider the regular expression

(1+0)*1 The NFA is

1C E

0D F

B

G

A H 1I J

Page 67: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

67

NFA to DFA

Page 68: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

68

Regular Expressions to Finite Automata

High-level sketch

Regularexpressions

NFA DFA

LexicalSpecification

Table-driven Implementation of DFA

Optimized DFA

Page 69: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

69

RegExp -> NFA :an Examlpe

Consider the regular expression

(1+0)*1 The NFA is

1C E

0D F

B

G

A H 1I J

Page 70: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

70

NFA to DFA. The Trick Simulate the NFA Each state of DFA

= a non-empty subset of states of the NFA Start state

= the set of NFA states reachable through -moves from NFA start state

Add a transition S a S’ to DFA iff» S’ is the set of NFA states reachable from any

state in S after seeing the input a– considering -moves as well

Page 71: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

71

NFA -> DFA Example

10 1

A BC

D

E

FG H I J

ABCDHI

FGABCDHI

EJGABCDHI

0

1

0

10 1

Page 72: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

72

NFA to DFA. Remark An NFA may be in many states at any time

How many different states ?

If there are N states, the NFA must be in some subset of those N states

How many non-empty subsets are there?» 2N - 1 = finitely many

Page 73: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

73

From an NFA to a DFA Subset construction Algorithm. Input. An NFA N. Output. A DFA D with states S and trasition table mv.begin

add -closure(s0) as an unmarked state to S;while there is an unmarked state T in S do begin

mark T;for each input symbol a do begin

U := -closure(move(T, a));if U is not in S then

add U as an unmarked state to S;mv[T, a] := U

end end end.

Page 74: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

74

Implementation A DFA can be implemented by a 2D table T

» One dimension is “states”» Other dimension is “input symbols”

» For every transition Si a Sk define mv[i,a] = k DFA “execution”

» If in state Si and input a, read mv[i,a] = k and skip to state Sk

» Very efficient

Page 75: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

75

Table Implementation of a DFA

S

T

U

0

1

0

10 1

0 1

S T U

T T U

U T U

Page 76: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

76

Simulation of a DFA Input. An input string ended with eof and a DFA with start state

s0 and final states F.Output. The answer “yes” if accepts, “no” otherwise.begin

s := s0;c := next_symbol();while c <> eof do begin

s := mv(s, c); c := next_symbol() end; if s is in F then return “yes” else return “no”end.

Page 77: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

77

Implementation (Cont.) NFA -> DFA conversion is at the heart of tools

such as flex

But, DFAs can be huge» DFA => optimized DFA : try to decrease the

number of states. » not always helpful!

In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations

Page 78: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

78

Time-Space Tradeoffs RE to NFA, simulate NFA

» time: O(|r| * |x|) , space: O(|r|) RE to NFA, NFA to DFA, simulate DFA

» time: O(|x|), space: O(2|r|) Lazy transition evaluation

» transitions are computed as needed at run time;

» computed transitions are stored in cache for later use

Page 79: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

79

DFA to optimized DFA

Page 80: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

80

MotivationsProblems:1. Given a DFA M with k states, is it possible to find an

equivalent DFA M’ (I.e., L(M) = L(M’)) with state number fewer than k ?

2. Given a regular language A, how to find a machine with minimum number of states ?

Ex: A = L((a+b)*aba(a+b)*) can be accepted by the following NFA:

By applying the subset construction, we can constructa DFA M2 with 24=16 states, of which only 6 are accessible from the initial state {s}.

s t u v

a b a

a,b a,b

Page 81: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

81

Inaccessible states A state p Q is said to be inaccessible (or

unreachable) [from the initial state] if there exists no path from from the initial state to it. If a state is not inaccessible, it is accessible.

Inaccessible states can be removed from the DFA without affecting the behavior of the machine.

Problem: Given a DFA (or NFA), how to find all inaccessible states ?

Page 82: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

82

Finding all accessible states:

(like e-closure) Input. An FA (DFA or NFA) Output. the set of all accessible statesbegin

push all start states onto stack; Add all start states into A;

while stack is not empty do beginpop t, the top element, off of stack;for each state u with an edge from t to udo

if u is not in A do begin add u to A; push u onto stackend

end;return A end.

Page 83: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

83

Minimization process Minimization process for a DFA:

» 1. Remove all inaccessible states» 2. Collapse all equivalent states

What does it mean that two states are equivalent?» both states have the same observable behaviors.i.e.,» there is no way to distinguish their difference, or» more formally, we say p and q are not equivalent(or

distinguishable) iff there is a string x * s.t. exactly one of (p,x) and (q,x) is a final state,

» where (p,x) is the ending state of the path from p with x as the input.

Equivalents sates can be merged to form a simpler machine.

Page 84: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

84

0

1

2 4

3a

aa,b

a,bab

b b5

a,b

0 5

a,b

1,2 3,4a,b a,b a,b

Example:

Page 85: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

85

Quotient Construction M=(Q,, ,s,F): a DFA. : a relation on Q defined by:

p q <=>for all x * (p,x) F iff (q,x) FProperty: is an equivalence relation. Hence it partitions Q into equivalence classes [p] = {q Q | p q} for p Q. and the quotient set

Q/ = {[p] | p Q}.Every p Q belongs to exactly one class [p] and p q iff [p]=[q].

Define the quotient machine M/ = <Q’,, ’,s’,F’> where» Q’=Q/ ; s’=[s]; F’={[p] | p F}; and’([p],a)=[(p,a)] for all p Q and a .

Page 86: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

86

Minimization algorithm input: a DFA output: a optimized DFA

1. Write down a table of all pairs {p,q}, initially unmarked.

2. mark {p,q} if p F and q ∈ F or vice versa.

3. Repeat until no more change:

3.1 if unmarked pair {p,q} s.t. {move(p,q), move(q,a)} is ∃marked for some a S, then mark {p,q}.∈

4. When done, p q iff {p,q} is not marked.

5. merge all equivalent states into one class and return the resulting machine

Page 87: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

87

An Example: The DFA:

a b

>0 1 2

1F 3 4

2F 4 3

3 5 5

4 5 5

5F 5 5

Page 88: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

88

Initial Table

1 -

2 - -

3 - - -

4 - - - -

5 - - - - -

0 1 2 3 4

Page 89: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

89

After step 2

1 M

2 M -

3 - M M

4 - M M -

5 M - - M M

0 1 2 3 4

Page 90: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

90

After first pass of step 3

1 M

2 M -

3 - M M

4 - M M -

5 M M M M M

0 1 2 3 4

Page 91: 1 Lexical Analysis Cheng-Chia Chen. 2 Outline l Introduction to lexical analyzer l Tokens l Regular expressions (RE) l Finite automata (FA) »deterministic

91

2nd pass of step 3. The result : 1 2 and 3 4.

1 M

2 M -

3 M2 M M

4 M2 M M -

5 M M1 M1 M M

0 1 2 3 4