principle of compilers lecture iii: lexical analysis—part...

42
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (1) Principle of Compilers Lecture III: Lexical Analysis—Part 1 Alessandro Artale Faculty of Computer Science – Free University of Bolzano Room: 221 [email protected] http://www.inf.unibz.it/ artale/ 2003/2004 – Second Semester

Upload: others

Post on 28-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (1)

Principle of CompilersLecture III: Lexical Analysis—Part 1

Alessandro ArtaleFaculty of Computer Science – Free University of Bolzano

Room: 221

[email protected]

http://www.inf.unibz.it/ �artale/

2003/2004 – Second Semester

Page 2: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (2)

Summary of Lecture III—Part 1

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions (RE).

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.

Page 3: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (3)

Lexical Analyzer

� Lexical Analysis is the first phase of a compiler.

� Main Task: Read the input characters and produce a sequence of Tokens that

will be processed by the Parser.

� Upon receiving a “get-next-token” command from the Parser input is read to

identify the next token.

� While looking for next token it eliminates comments and white-spaces.

� Tokens are treated as Terminal Symbols in the Grammar for the source

program.

Source Program Lexical Analyzer Parser

Symbol Table

token

get-next-token

Page 4: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (4)

Token Vs. Lexeme

� In general, a set of input strings (Lexemes) give rise to the same Token. For

example:

TOKEN LEXEME DESCRIPTION

const const language keyword

if if language keyword

relation ��� � �� �� �� � � � � � comparison operators

id position, A1, x sequence of letters and digits

num 3.14, 14, 6.02E23 numeric constant

Page 5: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (5)

Attributes for Tokens

� When a Token can be generated by different Lexemes the Lexical Analyzer

must transmit the Lexeme to the subsequent phases of the compiler.

� Such information is specified as an Attribute associated to the Token.

� Usually, the attribute of a Token is a pointer to the symbol table entry that

keeps information about the Token.

� The Token influences parsing decisions: Parser relies on the token distinc-

tions, e.g., identifiers are treated differently than keywords

� The Attribute influences the translation phase.

Page 6: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (6)

Attributes for Tokens: An Example

Example. Let us consider the following assignment statement:

��� � � � � � � �

then the following pairs

token,attribute

are passed to the Parser:

id, pointer to symbol-table entry for E

assign-op,

id, pointer to symbol-table entry for M

mult-op,

id, pointer to symbol-table entry for C�

exp-op,

num, integer value 2

.

� Some Tokens have a null attribute: the Token is sufficient to identify the

Lexeme.

� From an implementation point of view, each token is encoded as an integer

number.

Page 7: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (7)

Identifiers Vs. Keywords

� Programming languages use fixed strings to identify particular Keywords–

e.g., if, then, else, etc.

� Since keywords are just identifiers the Lexical Analyzer must distinguish

between these two possibilities.

� If keywords are reserved–not used as identifiers–we can initialize the

symbol-table with all the keywords and mark them as such.

� Then, a string is recognized as an identifier only if it is not already in the

symbol-table as a keyword.

Page 8: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (8)

Main Objective

What we want to accomplish:

� Given a way to describe Lexemes of the input language, automatically

generate the Lexical Analyzer.

This objective can be split in the following sub-problems:

1. Lexer specification language: How to represent Lexemes of the input

language.

2. The lexical analyzer mechanism: How to generate Tokens starting from

Lexemes.

3. Lexical analyzer implementation: Coding (1) + (2) + breaking inputs into

substrings corresponding to the pair Lexeme/Token.

Page 9: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (9)

Summary

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions (RE).

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.

Page 10: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (10)

Problem 1. Lexer Specification Language:Regular Expressions

� Regular Expressions are a the most popular specification formalism to

describe Lexemes and map them to Tokens.

� Example. An identifier is made by a letter followed by zero or more letters

or digits:

letter (letter�

digit)

The vertical bar

means “or”;

Parentheses are used to group sub-expressions;

The � means zero or more occurrences.

Page 11: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (11)

Regular Expressions

Each Regular Expression, say

, denotes a Language,

� � � �

. The following are

the rules to build them over an alphabet

:

1. If � � �� � �

then � is a Regular Expression denoting the language

� � �;2. If

�� �

are Regular Expressions denoting the Languages

� � � �

and

� � � �

then:

(a)

� � �

is a Regular Expression denoting� � � � � � � � �

;

(b)

� �

is a Regular Expression denoting the concatenation

� � � � � � � �

, i.e.� � � � � � � � � �� � � � � � � �and � � � � � � �

;

(c)

� �

is a Regular Expression denoting

� � � � �

, zero or more concatenations

of

� � � �

, i.e.

� � � � � � ����� � � � � � �

;

(d)

� � �

is a Regular Expression denoting

� � � �

.

Page 12: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (12)

Regular Expressions: Examples

Example. Let

� � � �� � �

.

1. The Regular Expression � � �

denotes the Language� �� � �

.

2. The Regular Expression

� � � � � � � � � �

denotes the Language

� � �� � �� � �� � � �

.

3. The Regular Expression � �

denotes the Language of all strings of zero or

more �’s,

� � �� � �� � � �� � � �

.

4. The Regular Expression

� � � � � �

denotes the Language of all strings of �’sand

’s.

Page 13: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (13)

Regular Expressions: Shorthands

Notational shorthands are introduced for frequently used constructors.

1.

, One or more instances. If

is a Regular Expression then

� �

� � � �

.

2.

, Zero or one instance. If

is a Regular Expression then

� � � � �

.

3. Character Classes. If �� �� � � � � � � �then

� �� �� � � � � � � � �, and

� � � � � � � � � �

� � �

� �.

Page 14: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (14)

Regular Definitions

� Regular Definitions are used to give names to regular Expressions and then

to re-use these names to build new Regular Expressions.

� A Regular Definition is a sequence of definitions of the form:

��� � ��

��� � ��

� � ���� � ��

Where each

� �is a distinct name and each

� �is a Regular Expression over

the extended alphabet

� � � � � � �� � � � � � � �� � �

.

� Note: Such names for Regular Expression will be often the Tokens returned

by the Lexical Analyzer. As a convention, names are printed in boldface.

Page 15: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (15)

Regular Definitions: Examples

Example 1. Identifiers are usually string of letters and digits beginning with a

letter:

� � � � �� � � � � �

. . .

� � � � � � �

� � �

� �

�� � � � � �� �� ��

� � � � � � � �� �� � � � �� � �� � � � �

Using Character Classes we can define identifiers as:

� � � � ��

� � � � � � ��

� � � � ��

� � �

Page 16: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (16)

Regular Definitions: Examples (Cont.)

Example 2. Numbers are usually strings such as 5230, 3.14, 6.45E4, 1.84E-4.

�� � � � � �� �� � �

� � � ��� � � � � � �

�� �� �� �� -

�� �� �� �� � ��

� � � � � �

�� �� �� �� - �� � �� �� � � � � � � �

�� � �� � ��� � �

� � � � � ��� �� �� �� �� -

�� �� �� �� �� �� �� �� - �� � �� �� �

Page 17: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (17)

Regular Expressions VS. Regular Grammars

� Languages captured by Regular Expressions could be captured by Regular

Grammars (Type 3 Grammars).

� Regular Expressions are a notational variant of Regular Grammars: Usually

they give a more compact representation.

� Example. The Regular Expression for numbers can be captured by a Regular

Grammar with the following Productions, where�

= num and digit is a

terminal symbol:

�� � � � � � � � �� � � �

� � � � � � � �� � � � �

���� �- �� � � �� -

� � �

���� �- �� � � � � � � �� � � ���� �- �� � � � � � �

�� � � � -� � �

� -

� � � � � ��� �� � �

��� �� � � �� � � � � � � � ��� �� �

��� �� � � � � � � � �� � � �� �� �

Page 18: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (18)

Summary

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions.

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.

Page 19: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (19)

Problem 2. The Lexical Analyzer Mechanism.Finite Automata

� We need a mechanism to recognize Regular Expressions and so associating

Tokens to Lexemes.

� While Regular Expressions are a specification language, Finite Automata

are their implementation.

� Finite Automata are recognizers for Regular Languages: Given an input

string, �, and a Language,

�, they answer “yes” if the � � �

and “no”

otherwise.

Page 20: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (20)

Deterministic Finite Automata

A Deterministic Finite Automata, DFA for short, is a tuple:� � � �� �� �� � � � � �

:

is a finite non empty set of states;

is the input symbol alphabet;

�� ��� � � �

is the Transition Function;

� � � � �

is the initial state;

�� �

is the set of final states.

Page 21: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (21)

Deterministic Finite Automata (Cont.)

To define when an Automaton accepts a string we extend the transition function,�

, to the domain

�� � �

:

� � �� � � �

� � �� � � � � � � � � �� �� � � ��� �

� � � � � � � � �A DFA accepts an input string, �, if starting from the initial state with � as input

the Automaton stops in a final state:� � � �� �� � �

, and

� � �

.

Page 22: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (22)

Transition Graphs

� A DFA can be represented by Transition Graphs where the nodes are the

states and each labeled edge represents the transition function.

� The initial state has an input arch marked start. Final states are indicated

by double circles.

� Example. DFA that accepts strings in the Language

� � � � � � � � � � � �

.

0 1 2 3start a b b

b b

a

a

a

Page 23: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (23)

Transition Table

� Transition Tables implement transition graphs.

� A Transition Table has a row for each state and a column for each input

symbol.

� The value of the cell

� � �� �� �

is the state that can be reached from state � �

with input � � .

� Example. The table implementing the previous transition graph will have

rows and

columns, let us call the table�

, then:

� � �� � � �� � � �� � � � �

� �� � � � �� � �� � � � � �

� � �� � � �� � � �� � � � �

� � �� � � �� � � �� � � � �

Page 24: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (24)

Summary

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions.

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.

Page 25: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (25)

Nondeterministic Finite Automata

A Finite Automaton is said Nondeterministic if we could have more than one

transition with a given input symbol.

A Nondeterministic Finite Automata, NFA, is a tuple:� � � �� �� �� � � � � �

:

is a finite non empty set of states;

is the input symbol alphabet;

�� ��� � � � �

is the Transition Function;

� � � � �

is the initial state;

�� �

is the set of final states.

Values in Transition Tables for NFA will be set of states.

Page 26: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (26)

Nondeterministic Finite Automata (Cont.)

Given an input string and an NFA there will be, in general, more then one path

that can be followed: An NFA accepts an input string if there is at least one path

ending in a final state.

To formally define when an NFA accepts a string we extend the transition

function,

, to the domain

��� � �

:

� � �� � � � � �

� � �� � � � �

� � � ��� � � �� � � � ��� �

� � � � � � � � �

An NFA accepts an input string, �, if starting from the initial state with � as input

the Automaton reaches a final state:

� � � � �� ��

, and � �

.

Page 27: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (27)

NFA: An Example

Example. NFA that accepts strings in the Language

� � � � � � � � � � � �.

0 1 2 3start a b b

b

a

Exercise. Check that

� � �� � � � � � �is accepted by the above NFA.

Page 28: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (28)

DFA Vs. NFA

� Both DFA and NFA are capable of recognizing all Regular Languages/Expressions.

� The main difference is a Space Vs. Time tradeoff:

– DFA are faster than NFA;

– DFA are bigger (exponentially larger) that NFA.

Page 29: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (29)

Summary

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions.

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.

Page 30: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (30)

From Regular Expressions to NFA (1)

� The algorithm that generates an NFA for a given Regular Expression (RE) is

guided by the syntactic structure of the RE.

� Given a RE, say , the Thompson’s construction generates an NFA accepting� � �

starting from the constituent subexpressions of —rules 1 and 2—and

then applying inductively rule 3.

� The NFA resulting from the Thompson’s construction has important prop-

erties: It has exactly one final state, no edge enters the start state, no edge

leaves the final state.

Page 31: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (31)

From Regular Expressions to NFA (2)

Notation: if is a RE then

� � �

is its NFA with transition graph:

� � �

Algorithm RE to NFA: Thompson’s Construction

1. For , the NFA is

� �start

Where

is the new start state and

the new final state.

Page 32: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (32)

From Regular Expressions to NFA (3)

2. For � � �

, the NFA is

� �start �

Where

is the new start state and

the new final state.

3. Suppose now that

� � � � and

� ��� �

are NFA’s then

(a) For RE � � �

, the NFA is

� � � �� ��� �

�start

Where

and�

are the new starting and accepting states, respectively.

Page 33: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (33)

From Regular Expressions to NFA (4)

3. (Cont.)

(b) For RE �� , the NFA is

� � � � �� �� �start

Where the start state of

� � � � is the start state of the new NFA, the final

state of

� ��� �

is the final state of the new NFA, the final state of

� � � � ismerged with the initial state of

� �� �.

(c) For RE � �

, the NFA is

i

� � � � fstart

Where

is the new start state and

the new final state.

Page 34: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (34)

From Regular Expressions to NFA (5)

3. (Cont.)

(d) For RE

� � � use the same NFA as

� � � � .end.

Remarks.

� Every time we insert new states we introduce a new name for them to

maintain all the states distinct.

� Even if a symbol appears many times we construct a new NFA for each

instance.

Page 35: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (35)

RE to NFA: An Example

Exercise. Build the NFA for the RE

� � � � � � � � �

.

Page 36: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (36)

Summary

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions.

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.

Page 37: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (37)

From NFA to DFA (1)

� NFA are hard to simulate with a computer program.

1. There are many possible paths for a given input string caused by the

nondeterminism;

2. The acceptance condition says that there must be at least one path ending

with a final state;

3. We may need to find all the paths before accepting/excluding a string.

� The algorithm to map an NFA to a DFA is called the Subset Construction.

� Main Idea: Each DFA state corresponds to a set of NFA states, thus encoding

all the possible states an NFA could reach after reading an input symbol.

� The algorithm makes use of the operation -closure: Given a set of states,

, -closure

� � �

returns the set of NFA states reachable from some � � �

on

-transition only.

Page 38: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (38)

From NFA to DFA (2)

Let us show the -closure algorithm.

-closure

� � �

CL T := T; /* Current Closure of T */

NOT EX := CL T; /* Not Examined States */

repeat /* Updates the Closure of T */

NEW CL :=

;

for each t in NOT EX do

Q:=

� ��� � �

;

if Q

��

CL T then

NEW CL := NEW CL

�(Q - CL T);

end

CL T := CL T

NEW CL;

NOT EX := NEW CL;

until NEW CL ==�

return CL T /* -Closure of T */

Page 39: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (39)

From NFA to DFA (3)

Before presenting the Subset Construction algorithm we give a formal definition

of the mapping NFA to DFA.

Definition. [NFA to DFA]

Let NFA =

� �� �� �� � � � � �

then the equivalent DFA =� � �� �� � �� � � � � � � �

where:

� � � � �

. With

� � � � � � � � � � �

we denote an element in

� �

, which stands for the

set of states

� � � � � � �� � � �

.

� � � � � � -� � �� � � � � � � � � �

;

� �

is the set of states in

� �

containing some element in

;

� � � � � � � � � � � � � � � � � � � �� � � � � � � ��

iff

� �� � � � � � � �� � -� � �� � � � � � � � � � � � � � � � �� ,

where

� � � � � � � � � � � � �� � � � � ��� � � � � �� � � .

Page 40: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (40)

From NFA to DFA (4)

Subset Construction() /* Subset Construction algorithm */

DS :=

� -� � �� � � � � � � � � �; /* Current Deterministic states */

NOT EX := DS; /* Not Examined States */

repeat /* DFA Construction */

NEW DS :=

;

for each T in NOT EX dofor each symbol � � �

doQ := -� � �� � � � � � �� � � �;if Q

��

DS then NEW DS := NEW DS

Q;� � � �� � � := Q; /* DFA’s Transition Function update */

endendDS := DS

NEW DS; /* DFA’s States update */

NOT EX := NEW DS;

until NEW DS ==�

return DS,

� �

/* DFA’s States and Transition Function */

Page 41: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (41)

NFA to DFA: An Example

0 1

2 3

4 5

6 7 8 9 10start � �

��

� �

� �

[0,1,2,4,7] [1,2,3,4,6,7,8]

[1,2,4,5,6,7]

[1,2,4,5,6,7,9] [1,2,4,5,6,7,10]start

��

��

Page 42: Principle of Compilers Lecture III: Lexical Analysis—Part ...tinman.cs.gsu.edu/~raj/4340/artale/slide3.pdf · Free University of Bolzano–Principles of Compilers. Lecture III–Part

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (42)

Summary of Lecture III—Part 1

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions.

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.