principle of compilers lecture iii: lexical analysis—part...
TRANSCRIPT
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (1)
Principle of CompilersLecture III: Lexical Analysis—Part 1
Alessandro ArtaleFaculty of Computer Science – Free University of Bolzano
Room: 221
http://www.inf.unibz.it/ �artale/
2003/2004 – Second Semester
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (2)
Summary of Lecture III—Part 1
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions (RE).
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (3)
Lexical Analyzer
� Lexical Analysis is the first phase of a compiler.
� Main Task: Read the input characters and produce a sequence of Tokens that
will be processed by the Parser.
� Upon receiving a “get-next-token” command from the Parser input is read to
identify the next token.
� While looking for next token it eliminates comments and white-spaces.
� Tokens are treated as Terminal Symbols in the Grammar for the source
program.
Source Program Lexical Analyzer Parser
Symbol Table
token
get-next-token
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (4)
Token Vs. Lexeme
� In general, a set of input strings (Lexemes) give rise to the same Token. For
example:
TOKEN LEXEME DESCRIPTION
const const language keyword
if if language keyword
relation ��� � �� �� �� � � � � � comparison operators
id position, A1, x sequence of letters and digits
num 3.14, 14, 6.02E23 numeric constant
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (5)
Attributes for Tokens
� When a Token can be generated by different Lexemes the Lexical Analyzer
must transmit the Lexeme to the subsequent phases of the compiler.
� Such information is specified as an Attribute associated to the Token.
� Usually, the attribute of a Token is a pointer to the symbol table entry that
keeps information about the Token.
� The Token influences parsing decisions: Parser relies on the token distinc-
tions, e.g., identifiers are treated differently than keywords
� The Attribute influences the translation phase.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (6)
Attributes for Tokens: An Example
Example. Let us consider the following assignment statement:
��� � � � � � � �
then the following pairs
�
token,attribute
�
are passed to the Parser:
�
id, pointer to symbol-table entry for E
�
�
assign-op,
�
�
id, pointer to symbol-table entry for M
�
�
mult-op,
�
�
id, pointer to symbol-table entry for C�
�
exp-op,
�
�
num, integer value 2
�
.
� Some Tokens have a null attribute: the Token is sufficient to identify the
Lexeme.
� From an implementation point of view, each token is encoded as an integer
number.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (7)
Identifiers Vs. Keywords
� Programming languages use fixed strings to identify particular Keywords–
e.g., if, then, else, etc.
� Since keywords are just identifiers the Lexical Analyzer must distinguish
between these two possibilities.
� If keywords are reserved–not used as identifiers–we can initialize the
symbol-table with all the keywords and mark them as such.
� Then, a string is recognized as an identifier only if it is not already in the
symbol-table as a keyword.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (8)
Main Objective
What we want to accomplish:
� Given a way to describe Lexemes of the input language, automatically
generate the Lexical Analyzer.
This objective can be split in the following sub-problems:
1. Lexer specification language: How to represent Lexemes of the input
language.
2. The lexical analyzer mechanism: How to generate Tokens starting from
Lexemes.
3. Lexical analyzer implementation: Coding (1) + (2) + breaking inputs into
substrings corresponding to the pair Lexeme/Token.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (9)
Summary
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions (RE).
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (10)
Problem 1. Lexer Specification Language:Regular Expressions
� Regular Expressions are a the most popular specification formalism to
describe Lexemes and map them to Tokens.
� Example. An identifier is made by a letter followed by zero or more letters
or digits:
letter (letter�
digit)
�
The vertical bar
�
means “or”;
Parentheses are used to group sub-expressions;
The � means zero or more occurrences.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (11)
Regular Expressions
Each Regular Expression, say
�
, denotes a Language,
� � � �
. The following are
the rules to build them over an alphabet
�
:
1. If � � �� � �
then � is a Regular Expression denoting the language
� � �;2. If
�� �
are Regular Expressions denoting the Languages
� � � �
and
� � � �
then:
(a)
� � �
is a Regular Expression denoting� � � � � � � � �
;
(b)
� �
is a Regular Expression denoting the concatenation
� � � � � � � �
, i.e.� � � � � � � � � �� � � � � � � �and � � � � � � �
;
(c)
� �
is a Regular Expression denoting
� � � � �
, zero or more concatenations
of
� � � �
, i.e.
� � � � � � ����� � � � � � �
;
(d)
� � �
is a Regular Expression denoting
� � � �
.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (12)
Regular Expressions: Examples
Example. Let
� � � �� � �
.
1. The Regular Expression � � �
denotes the Language� �� � �
.
2. The Regular Expression
� � � � � � � � � �
denotes the Language
� � �� � �� � �� � � �
.
3. The Regular Expression � �
denotes the Language of all strings of zero or
more �’s,
� � �� � �� � � �� � � �
�
.
4. The Regular Expression
� � � � � �
denotes the Language of all strings of �’sand
�
’s.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (13)
Regular Expressions: Shorthands
Notational shorthands are introduced for frequently used constructors.
1.
�
, One or more instances. If
�
is a Regular Expression then
� �
� � � �
.
2.
�
, Zero or one instance. If
�
is a Regular Expression then
� � � � �
.
3. Character Classes. If �� �� � � � � � � �then
� �� �� � � � � � � � �, and
� � � � � � � � � �
� � �
� �.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (14)
Regular Definitions
� Regular Definitions are used to give names to regular Expressions and then
to re-use these names to build new Regular Expressions.
� A Regular Definition is a sequence of definitions of the form:
��� � ��
��� � ��
� � ���� � ��
Where each
� �is a distinct name and each
� �is a Regular Expression over
the extended alphabet
� � � � � � �� � � � � � � �� � �
.
� Note: Such names for Regular Expression will be often the Tokens returned
by the Lexical Analyzer. As a convention, names are printed in boldface.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (15)
Regular Definitions: Examples
Example 1. Identifiers are usually string of letters and digits beginning with a
letter:
� � � � �� � � � � �
. . .
� � � � � � �
� � �
� �
�� � � � � �� �� ��
� � � � � � � �� �� � � � �� � �� � � � �
Using Character Classes we can define identifiers as:
� � � � ��
� � � � � � ��
� � � � ��
� � �
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (16)
Regular Definitions: Examples (Cont.)
Example 2. Numbers are usually strings such as 5230, 3.14, 6.45E4, 1.84E-4.
�� � � � � �� �� � �
� � � ��� � � � � � �
�� �� �� �� -
�� �� �� �� � ��
� � � � � �
�� �� �� �� - �� � �� �� � � � � � � �
�� � �� � ��� � �
� � � � � ��� �� �� �� �� -
�� �� �� �� �� �� �� �� - �� � �� �� �
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (17)
Regular Expressions VS. Regular Grammars
� Languages captured by Regular Expressions could be captured by Regular
Grammars (Type 3 Grammars).
� Regular Expressions are a notational variant of Regular Grammars: Usually
they give a more compact representation.
� Example. The Regular Expression for numbers can be captured by a Regular
Grammar with the following Productions, where�
= num and digit is a
terminal symbol:
�� � � � � � � � �� � � �
� � � � � � � �� � � � �
�
���� �- �� � � �� -
� � �
���� �- �� � � � � � � �� � � ���� �- �� � � � � � �
�� � � � -� � �
� -
� � � � � ��� �� � �
�
��� �� � � �� � � � � � � � ��� �� �
��� �� � � � � � � � �� � � �� �� �
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (18)
Summary
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions.
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (19)
Problem 2. The Lexical Analyzer Mechanism.Finite Automata
� We need a mechanism to recognize Regular Expressions and so associating
Tokens to Lexemes.
� While Regular Expressions are a specification language, Finite Automata
are their implementation.
� Finite Automata are recognizers for Regular Languages: Given an input
string, �, and a Language,
�, they answer “yes” if the � � �
and “no”
otherwise.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (20)
Deterministic Finite Automata
A Deterministic Finite Automata, DFA for short, is a tuple:� � � �� �� �� � � � � �
:
�
�
is a finite non empty set of states;
�
�
is the input symbol alphabet;
�
�� ��� � � �
is the Transition Function;
� � � � �
is the initial state;
�
�� �
is the set of final states.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (21)
Deterministic Finite Automata (Cont.)
To define when an Automaton accepts a string we extend the transition function,�
, to the domain
�� � �
:
� � �� � � �
� � �� � � � � � � � � �� �� � � ��� �
� � � � � � � � �A DFA accepts an input string, �, if starting from the initial state with � as input
the Automaton stops in a final state:� � � �� �� � �
, and
� � �
.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (22)
Transition Graphs
� A DFA can be represented by Transition Graphs where the nodes are the
states and each labeled edge represents the transition function.
� The initial state has an input arch marked start. Final states are indicated
by double circles.
� Example. DFA that accepts strings in the Language
� � � � � � � � � � � �
.
0 1 2 3start a b b
b b
a
a
a
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (23)
Transition Table
� Transition Tables implement transition graphs.
� A Transition Table has a row for each state and a column for each input
symbol.
� The value of the cell
� � �� �� �
is the state that can be reached from state � �
with input � � .
� Example. The table implementing the previous transition graph will have
�
rows and
�
columns, let us call the table�
, then:
� � �� � � �� � � �� � � � �
� �� � � � �� � �� � � � � �
� � �� � � �� � � �� � � � �
� � �� � � �� � � �� � � � �
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (24)
Summary
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions.
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (25)
Nondeterministic Finite Automata
A Finite Automaton is said Nondeterministic if we could have more than one
transition with a given input symbol.
A Nondeterministic Finite Automata, NFA, is a tuple:� � � �� �� �� � � � � �
:
�
�
is a finite non empty set of states;
�
�
is the input symbol alphabet;
�
�� ��� � � � �
is the Transition Function;
� � � � �
is the initial state;
�
�� �
is the set of final states.
Values in Transition Tables for NFA will be set of states.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (26)
Nondeterministic Finite Automata (Cont.)
Given an input string and an NFA there will be, in general, more then one path
that can be followed: An NFA accepts an input string if there is at least one path
ending in a final state.
To formally define when an NFA accepts a string we extend the transition
function,
�
, to the domain
��� � �
:
� � �� � � � � �
� � �� � � � �
� � � ��� � � �� � � � ��� �
� � � � � � � � �
An NFA accepts an input string, �, if starting from the initial state with � as input
the Automaton reaches a final state:
�
� � � � �� ��
, and � �
.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (27)
NFA: An Example
Example. NFA that accepts strings in the Language
� � � � � � � � � � � �.
0 1 2 3start a b b
b
a
Exercise. Check that
� � �� � � � � � �is accepted by the above NFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (28)
DFA Vs. NFA
� Both DFA and NFA are capable of recognizing all Regular Languages/Expressions.
� The main difference is a Space Vs. Time tradeoff:
– DFA are faster than NFA;
– DFA are bigger (exponentially larger) that NFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (29)
Summary
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions.
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (30)
From Regular Expressions to NFA (1)
� The algorithm that generates an NFA for a given Regular Expression (RE) is
guided by the syntactic structure of the RE.
� Given a RE, say , the Thompson’s construction generates an NFA accepting� � �
starting from the constituent subexpressions of —rules 1 and 2—and
then applying inductively rule 3.
� The NFA resulting from the Thompson’s construction has important prop-
erties: It has exactly one final state, no edge enters the start state, no edge
leaves the final state.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (31)
From Regular Expressions to NFA (2)
Notation: if is a RE then
� � �
is its NFA with transition graph:
� � �
Algorithm RE to NFA: Thompson’s Construction
1. For , the NFA is
� �start
Where
�
is the new start state and
�
the new final state.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (32)
From Regular Expressions to NFA (3)
2. For � � �
, the NFA is
� �start �
Where
�
is the new start state and
�
the new final state.
3. Suppose now that
� � � � and
� ��� �
are NFA’s then
(a) For RE � � �
, the NFA is
�
� � � �� ��� �
�start
Where
�
and�
are the new starting and accepting states, respectively.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (33)
From Regular Expressions to NFA (4)
3. (Cont.)
(b) For RE �� , the NFA is
� � � � �� �� �start
Where the start state of
� � � � is the start state of the new NFA, the final
state of
� ��� �
is the final state of the new NFA, the final state of
� � � � ismerged with the initial state of
� �� �.
(c) For RE � �
, the NFA is
i
� � � � fstart
Where
�
is the new start state and
�
the new final state.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (34)
From Regular Expressions to NFA (5)
3. (Cont.)
(d) For RE
� � � use the same NFA as
� � � � .end.
Remarks.
� Every time we insert new states we introduce a new name for them to
maintain all the states distinct.
� Even if a symbol appears many times we construct a new NFA for each
instance.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (35)
RE to NFA: An Example
Exercise. Build the NFA for the RE
� � � � � � � � �
.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (36)
Summary
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions.
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (37)
From NFA to DFA (1)
� NFA are hard to simulate with a computer program.
1. There are many possible paths for a given input string caused by the
nondeterminism;
2. The acceptance condition says that there must be at least one path ending
with a final state;
3. We may need to find all the paths before accepting/excluding a string.
� The algorithm to map an NFA to a DFA is called the Subset Construction.
� Main Idea: Each DFA state corresponds to a set of NFA states, thus encoding
all the possible states an NFA could reach after reading an input symbol.
� The algorithm makes use of the operation -closure: Given a set of states,
�
, -closure
� � �
returns the set of NFA states reachable from some � � �
on
-transition only.
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (38)
From NFA to DFA (2)
Let us show the -closure algorithm.
-closure
� � �
CL T := T; /* Current Closure of T */
NOT EX := CL T; /* Not Examined States */
repeat /* Updates the Closure of T */
NEW CL :=
�
;
for each t in NOT EX do
Q:=
� ��� � �
;
if Q
��
CL T then
NEW CL := NEW CL
�(Q - CL T);
end
CL T := CL T
�
NEW CL;
NOT EX := NEW CL;
until NEW CL ==�
return CL T /* -Closure of T */
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (39)
From NFA to DFA (3)
Before presenting the Subset Construction algorithm we give a formal definition
of the mapping NFA to DFA.
Definition. [NFA to DFA]
Let NFA =
� �� �� �� � � � � �
then the equivalent DFA =� � �� �� � �� � � � � � � �
where:
�
� � � � �
. With
� � � � � � � � � � �
we denote an element in
� �
, which stands for the
set of states
� � � � � � �� � � �
.
� � � � � � -� � �� � � � � � � � � �
;
�
� �
is the set of states in
� �
containing some element in
�
;
�
� � � � � � � � � � � � � � � � � � � �� � � � � � � ��
iff
� �� � � � � � � �� � -� � �� � � � � � � � � � � � � � � � �� ,
where
� � � � � � � � � � � � �� � � � � ��� � � � � �� � � .
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (40)
From NFA to DFA (4)
Subset Construction() /* Subset Construction algorithm */
DS :=
� -� � �� � � � � � � � � �; /* Current Deterministic states */
NOT EX := DS; /* Not Examined States */
repeat /* DFA Construction */
NEW DS :=
�
;
for each T in NOT EX dofor each symbol � � �
doQ := -� � �� � � � � � �� � � �;if Q
��
DS then NEW DS := NEW DS
�
Q;� � � �� � � := Q; /* DFA’s Transition Function update */
endendDS := DS
�
NEW DS; /* DFA’s States update */
NOT EX := NEW DS;
until NEW DS ==�
return DS,
� �
/* DFA’s States and Transition Function */
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (41)
NFA to DFA: An Example
0 1
2 3
4 5
6 7 8 9 10start � �
�
�
�
��
� �
� �
�
�
[0,1,2,4,7] [1,2,3,4,6,7,8]
[1,2,4,5,6,7]
[1,2,4,5,6,7,9] [1,2,4,5,6,7,10]start
��
�
��
�
�
�
Free University of Bolzano–Principles of Compilers. Lecture III–Part 1, 2003/2004 – A.Artale (42)
Summary of Lecture III—Part 1
� Lexemes and Tokens.
� Lexer Representation: Regular Expressions.
� Implementing a Recognizer of RE’s: Automata.
– Deterministic Finite Automata (DFA).
– Nondeterministic Finite Automata (NFA).
– From Regular Expressions to NFA.
– From NFA to DFA.