principle of compilers lecture iii: lexical analysis—part...

Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (1)

Principle of CompilersLecture III: Lexical Analysis—Part 1

Alessandro ArtaleFaculty of Computer Science – Free University of Bolzano

Room: 221

[email protected]

http://www.inf.unibz.it/ �artale/

2003/2004 – Second Semester


Summary of Lecture III—Part 1

� Lexemes and Tokens.

� Lexer Representation: Regular Expressions (RE).

� Implementing a Recognizer of RE’s: Automata.

– Deterministic Finite Automata (DFA).

– Nondeterministic Finite Automata (NFA).

– From Regular Expressions to NFA.

– From NFA to DFA.


Lexical Analyzer

� Lexical Analysis is the first phase of a compiler.

� Main Task: Read the input characters and produce a sequence of Tokens that

will be processed by the Parser.

� Upon receiving a “get-next-token” command from the Parser input is read to

identify the next token.

� While looking for next token it eliminates comments and white-spaces.

� Tokens are treated as Terminal Symbols in the Grammar for the source

program.

Source Program Lexical Analyzer Parser

Symbol Table

token

get-next-token


Token Vs. Lexeme

� In general, a set of input strings (Lexemes) give rise to the same Token. For

example:

TOKEN LEXEME DESCRIPTION

const const language keyword

if if language keyword

relation �� comparison operators

id position, A1, x sequence of letters and digits

num 3.14, 14, 6.02E23 numeric constant


Attributes for Tokens

� When a Token can be generated by different Lexemes the Lexical Analyzer

must transmit the Lexeme to the subsequent phases of the compiler.

� Such information is specified as an Attribute associated to the Token.

� Usually, the attribute of a Token is a pointer to the symbol table entry that

keeps information about the Token.

� The Token influences parsing decisions: Parser relies on the token distinc-

tions, e.g., identifiers are treated differently than keywords

� The Attribute influences the translation phase.


Attributes for Tokens: An Example

Example. Let us consider the following assignment statement:

��

then the following pairs

�

token,attribute

�

are passed to the Parser:

�

id, pointer to symbol-table entry for E

�

�

assign-op,

�

�

id, pointer to symbol-table entry for M

�

�

mult-op,

�

�

id, pointer to symbol-table entry for C�

�

exp-op,

�

�

num, integer value 2

�

.

� Some Tokens have a null attribute: the Token is sufficient to identify the

Lexeme.

� From an implementation point of view, each token is encoded as an integer

number.


Identifiers Vs. Keywords

� Programming languages use fixed strings to identify particular Keywords–

e.g., if, then, else, etc.

� Since keywords are just identifiers the Lexical Analyzer must distinguish

between these two possibilities.

� If keywords are reserved–not used as identifiers–we can initialize the

symbol-table with all the keywords and mark them as such.

� Then, a string is recognized as an identifier only if it is not already in the

symbol-table as a keyword.


Main Objective

What we want to accomplish:

� Given a way to describe Lexemes of the input language, automatically

generate the Lexical Analyzer.

This objective can be split in the following sub-problems:

1. Lexer specification language: How to represent Lexemes of the input

language.

2. The lexical analyzer mechanism: How to generate Tokens starting from

Lexemes.

3. Lexical analyzer implementation: Coding (1) + (2) + breaking inputs into

substrings corresponding to the pair Lexeme/Token.


Summary


� Lexer Representation: Regular Expressions (RE).







Problem 1. Lexer Specification Language:Regular Expressions

� Regular Expressions are a the most popular specification formalism to

describe Lexemes and map them to Tokens.

� Example. An identifier is made by a letter followed by zero or more letters

or digits:

letter (letter�

digit)

�

The vertical bar

�

means “or”;

Parentheses are used to group sub-expressions;

The � means zero or more occurrences.


Regular Expressions

Each Regular Expression, say

�

, denotes a Language,

� � � �

. The following are

the rules to build them over an alphabet

�

:

1. If � � ��

then � is a Regular Expression denoting the language

� � �;2. If

��

are Regular Expressions denoting the Languages

� � � �

and

� � � �

then:

(a)

� � �

is a Regular Expression denoting� � � � � � � � �

;

(b)

� �

is a Regular Expression denoting the concatenation

� � � � � � � �

, i.e.� � � � � � � � � �� and � � � � � � �

;

(c)

� �

is a Regular Expression denoting

� � � � �

, zero or more concatenations

of

� � � �

, i.e.

� � � � � � ��

;

(d)

� � �

is a Regular Expression denoting

� � � �

.


Regular Expressions: Examples

Example. Let

� � � ��

.

1. The Regular Expression � � �

denotes the Language� ��

.

2. The Regular Expression

� � � � � � � � � �

denotes the Language

� � ��

.

3. The Regular Expression � �

denotes the Language of all strings of zero or

more �’s,

� � ��

�

.

4. The Regular Expression

� � � � � �

denotes the Language of all strings of �’sand

�

’s.


Regular Expressions: Shorthands

Notational shorthands are introduced for frequently used constructors.

1.

�

, One or more instances. If

�

is a Regular Expression then

� �

� � � �

.

2.

�

, Zero or one instance. If

�

is a Regular Expression then

� � � � �

.

3. Character Classes. If �� then

� �� , and

� � � � � � � � � �

� � �

� �.


Regular Definitions

� Regular Definitions are used to give names to regular Expressions and then

to re-use these names to build new Regular Expressions.

� A Regular Definition is a sequence of definitions of the form:

��

��

� � ��

Where each

� �is a distinct name and each

� �is a Regular Expression over

the extended alphabet

� � � � � � ��

.

� Note: Such names for Regular Expression will be often the Tokens returned

by the Lexical Analyzer. As a convention, names are printed in boldface.


Regular Definitions: Examples

Example 1. Identifiers are usually string of letters and digits beginning with a

letter:

� � � � ��

. . .

� � � � � � �

� � �

� �

��

� � � � � � � ��

Using Character Classes we can define identifiers as:

� � � � ��

� � � � � � ��

� � � � ��

� � �


Regular Definitions: Examples (Cont.)

Example 2. Numbers are usually strings such as 5230, 3.14, 6.45E4, 1.84E-4.

��

� � � ��

�� -

��

� � � � � �

�� - ��

��

� � � � � �� -

�� - ��


Regular Expressions VS. Regular Grammars

� Languages captured by Regular Expressions could be captured by Regular

Grammars (Type 3 Grammars).

� Regular Expressions are a notational variant of Regular Grammars: Usually

they give a more compact representation.

� Example. The Regular Expression for numbers can be captured by a Regular

Grammar with the following Productions, where�

= num and digit is a

terminal symbol:

��

� � � � � � � ��

�

�� - �� -

� � �

�� - �� - ��

�� -� � �

� -

� � � � � ��

�

��

��


Summary


� Lexer Representation: Regular Expressions.







Problem 2. The Lexical Analyzer Mechanism.Finite Automata

� We need a mechanism to recognize Regular Expressions and so associating

Tokens to Lexemes.

� While Regular Expressions are a specification language, Finite Automata

are their implementation.

� Finite Automata are recognizers for Regular Languages: Given an input

string, �, and a Language,

�, they answer “yes” if the � � �

and “no”

otherwise.


Deterministic Finite Automata

A Deterministic Finite Automata, DFA for short, is a tuple:� � � ��

:

�

�

is a finite non empty set of states;

�

�

is the input symbol alphabet;

�

��

is the Transition Function;

� � � � �

is the initial state;

�

��

is the set of final states.


Deterministic Finite Automata (Cont.)

To define when an Automaton accepts a string we extend the transition function,�

, to the domain

��

:

� � ��

� � ��

� � � � � � � � �A DFA accepts an input string, �, if starting from the initial state with � as input

the Automaton stops in a final state:� � � ��

, and

� � �

.


Transition Graphs

� A DFA can be represented by Transition Graphs where the nodes are the

states and each labeled edge represents the transition function.

� The initial state has an input arch marked start. Final states are indicated

by double circles.

� Example. DFA that accepts strings in the Language

� � � � � � � � � � � �

.

0 1 2 3start a b b

b b

a

a

a


Transition Table

� Transition Tables implement transition graphs.

� A Transition Table has a row for each state and a column for each input

symbol.

� The value of the cell

� � ��

is the state that can be reached from state � �

with input � � .

� Example. The table implementing the previous transition graph will have

�

rows and

�

columns, let us call the table�

, then:

� � ��

� ��

� � ��

� � ��


Summary









Nondeterministic Finite Automata

A Finite Automaton is said Nondeterministic if we could have more than one

transition with a given input symbol.

A Nondeterministic Finite Automata, NFA, is a tuple:� � � ��

:

�

�

is a finite non empty set of states;

�

�

is the input symbol alphabet;

�

��

is the Transition Function;

� � � � �

is the initial state;

�

��

is the set of final states.

Values in Transition Tables for NFA will be set of states.


Nondeterministic Finite Automata (Cont.)

Given an input string and an NFA there will be, in general, more then one path

that can be followed: An NFA accepts an input string if there is at least one path

ending in a final state.

To formally define when an NFA accepts a string we extend the transition

function,

�

, to the domain

��

:

� � ��

� � ��

� � � ��

� � � � � � � � �

An NFA accepts an input string, �, if starting from the initial state with � as input

the Automaton reaches a final state:

�

� � � � ��

, and � �

.


NFA: An Example

Example. NFA that accepts strings in the Language

� � � � � � � � � � � �.

0 1 2 3start a b b

b

a

Exercise. Check that

� � �� is accepted by the above NFA.


DFA Vs. NFA

� Both DFA and NFA are capable of recognizing all Regular Languages/Expressions.

� The main difference is a Space Vs. Time tradeoff:

– DFA are faster than NFA;

– DFA are bigger (exponentially larger) that NFA.


Summary









From Regular Expressions to NFA (1)

� The algorithm that generates an NFA for a given Regular Expression (RE) is

guided by the syntactic structure of the RE.

� Given a RE, say , the Thompson’s construction generates an NFA accepting� � �

starting from the constituent subexpressions of —rules 1 and 2—and

then applying inductively rule 3.

� The NFA resulting from the Thompson’s construction has important prop-

erties: It has exactly one final state, no edge enters the start state, no edge

leaves the final state.



Notation: if is a RE then

� � �

is its NFA with transition graph:

� � �

Algorithm RE to NFA: Thompson’s Construction

1. For , the NFA is

� �start

Where

�

is the new start state and

�

the new final state.



2. For � � �

, the NFA is

� �start �

Where

�


�


3. Suppose now that

� � � � and

� ��

are NFA’s then

(a) For RE � � �

, the NFA is

�

� � � ��

�start

Where

�

and�

are the new starting and accepting states, respectively.



3. (Cont.)

(b) For RE �� , the NFA is

� � � � �� start

Where the start state of

� � � � is the start state of the new NFA, the final

state of

� ��

is the final state of the new NFA, the final state of

� � � � ismerged with the initial state of

� �� .

(c) For RE � �

, the NFA is

i

� � � � fstart

Where

�


�




3. (Cont.)

(d) For RE

� � � use the same NFA as

� � � � .end.

Remarks.

� Every time we insert new states we introduce a new name for them to

maintain all the states distinct.

� Even if a symbol appears many times we construct a new NFA for each

instance.


RE to NFA: An Example

Exercise. Build the NFA for the RE

� � � � � � � � �

.


Summary









From NFA to DFA (1)

� NFA are hard to simulate with a computer program.

1. There are many possible paths for a given input string caused by the

nondeterminism;

2. The acceptance condition says that there must be at least one path ending

with a final state;

3. We may need to find all the paths before accepting/excluding a string.

� The algorithm to map an NFA to a DFA is called the Subset Construction.

� Main Idea: Each DFA state corresponds to a set of NFA states, thus encoding

all the possible states an NFA could reach after reading an input symbol.

� The algorithm makes use of the operation -closure: Given a set of states,

�

, -closure

� � �

returns the set of NFA states reachable from some � � �

on

-transition only.


From NFA to DFA (2)

Let us show the -closure algorithm.

-closure

� � �

CL T := T; /* Current Closure of T */

NOT EX := CL T; /* Not Examined States */

repeat /* Updates the Closure of T */

NEW CL :=

�

;

for each t in NOT EX do

Q:=

� ��

;

if Q

��

CL T then

NEW CL := NEW CL

�(Q - CL T);

end

CL T := CL T

�

NEW CL;

NOT EX := NEW CL;

until NEW CL ==�

return CL T /* -Closure of T */


From NFA to DFA (3)

Before presenting the Subset Construction algorithm we give a formal definition

of the mapping NFA to DFA.

Definition. [NFA to DFA]

Let NFA =

� ��

then the equivalent DFA =� � ��

where:

�

� � � � �

. With

� � � � � � � � � � �

we denote an element in

� �

, which stands for the

set of states

� � � � � � ��

.

� � � � � � -� � ��

;

�

� �

is the set of states in

� �

containing some element in

�

;

�

� � � � � � � � � � � � � � � � � � � ��

iff

� �� -� � �� ,

where

� � � � � � � � � � � � �� .


From NFA to DFA (4)

Subset Construction() /* Subset Construction algorithm */

DS :=

� -� � �� ; /* Current Deterministic states */

NOT EX := DS; /* Not Examined States */

repeat /* DFA Construction */

NEW DS :=

�

;

for each T in NOT EX dofor each symbol � � �

doQ := -� � �� ;if Q

��

DS then NEW DS := NEW DS

�

Q;� � � �� := Q; /* DFA’s Transition Function update */

endendDS := DS

�

NEW DS; /* DFA’s States update */

NOT EX := NEW DS;

until NEW DS ==�

return DS,

� �

/* DFA’s States and Transition Function */


NFA to DFA: An Example

0 1

2 3

4 5

6 7 8 9 10start � �

�

�

�

��

� �

� �

�

�

[0,1,2,4,7] [1,2,3,4,6,7,8]

[1,2,4,5,6,7]

[1,2,4,5,6,7,9] [1,2,4,5,6,7,10]start

��

�

��

�

�

�


Summary of Lecture III—Part 1








principle of compilers lecture iii: lexical analysis—part...

Documents