1 lexical analysis cheng-chia chen. 2 outline l introduction to lexical analyzer l tokens l regular...
Post on 19-Dec-2015
226 views
TRANSCRIPT
1
Lexical Analysis
Cheng-Chia Chen
2
Outline Introduction to lexical analyzer Tokens Regular expressions (RE) Finite automata (FA)
» deterministic and nondeterministic finite automata (DFA and NFA)
» from RE to NFA» from NFA to DFA» from DFA to optimized DFA
3
Outline Informal sketch of lexical analysis
» Identifies tokens in input string
Issues in lexical analysis» Lookahead» Ambiguities
Specifying lexers» Regular expressions» Examples of regular expressions
4
The Structure of a Compiler
Source Tokens
Interm.Language
Lexicalanalysis
Parsing
CodeGen.
MachineCode
Optimization
5
Lexical Analysis What do we want to do? Example:
if (i == j)Z = 0;
elseZ = 1;
The input is just a sequence of characters:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Goal: Partition input string into substrings» And determine the categories (tokens) to which the
substrings belong
6
What’s a Token? Output of lexical analysis is a stream of tokens A token is a syntactic category
» In English:noun, verb, adjective, …
» In a programming language:Identifier, Integer, Keyword, Whitespace, …
Parser relies on the token distinctions: identifiers are treated differently than keywords
7
Aspects of Token language view: a set of strings (belonging to that
token)» [while], [identifier], [arithOp]
Pattern (grammar): a rule defining a token» [while]: while» [identifier]: letter followed by letters and digits» [arithOp]: + or - or * or /
Lexeme (member): a string matched by the pattern of a token» while, var23, count, +, *
8
Attributes of Tokens Attributes are used to distinguish different lexemes in
a token» [while,]» [identifier, var35]» [arithOp, +]» [integer, 26]» [string, “26”]
positional information: start/end line/position Tokens affect syntax analysis and attributes affect
semantic analysis and error handling
9
Lexical Analyzer: Implementation
An implementation must do two things:
1. Recognize substrings corresponding to tokens
2. Return the attributes( lexeme and positional information) of the token– The lexeme is either the substring or some data
object constructed from the substring.
10
Example input lines:
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Token-lexeme pairs returned by the lexer:» [Whitespace, “\t”]» [if, ]» [OpenPar, “(“] » [Identifier, “i”]» [Relation, “==“]» [Identifier, “j”]» …
11
Lexical Analyzer: Implementation
The lexer usually discards “uninteresting” tokens that don’t contribute to parsing.
Examples: Whitespace, Comments
Question: What happens if we remove all whitespace and all comments prior to lexing?
12
Lexical Analysis in FORTRAN FORTRAN rule: Whitespace is insignificant
E.g., VAR1 is the same as VA R1
Footnote: FORTRAN whitespace rule motivated by inaccuracy of punch card operators
13
A terrible design! Example Consider
» DO 5 I = 1,25» DO 5 I = 1.25
The first is DO 5 I = 1 , 25 The second is DO5I = 1.25
Reading left-to-right, cannot tell if DO5I is a variable or DO stmt. until after “,” is reached
14
Lexical Analysis in FORTRAN. Lookahead.
Two important points:
1. The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time
2. “Lookahead” may be required to decide where one token ends and the next token begins
» Even our simple example has lookahead issues
i vs. if = vs. ==
15
Lexical Analysis in PL/I PL/I keywords are not reserved
IF ELSE THEN THEN = ELSE; ELSE ELSE = THEN
PL/I Declarations:
DECLARE (ARG1,. . ., ARGN)
Can’t tell whether DECLARE is a keyword or array reference until after the ).» Requires arbitrary lookahead!
16
Review The goal of lexical analysis is to
» Partition the input string into lexemes» Identify the token of each lexeme
Left-to-right scan => lookahead sometimes required
17
Next We still need
» A way to describe the lexemes of each token
» A way to resolve ambiguities– Is if two variables i and f?– Is == two equal signs = =?
18
Regular Expressions and Regular Languages
There are several formalisms for specifying tokens
Regular languages are the most popular» Simple and useful theory» Easy to understand» Efficient implementations
19
Languages
Def. Let be a set of characters. A language over is a set of strings of characters drawn from
( is called the alphabet )
20
Examples of Languages Alphabet = English
characters Language = English
words
Not every string on English characters is an English word» likes, school,…» beee,yykk,…
Alphabet = ASCII Language = C programs
Note: ASCII character set is different from English character set
21
Notation Languages are sets of strings.
Need some notation for specifying which sets we want
For lexical analysis we care about regular languages, which can be described using regular expressions.
22
Regular Expressions and Regular Languages
Each regular expression is a notation for a regular language (a set of words)
If A is a regular expression then we write L(A) to refer to the language denoted by A
23
Atomic Regular Expressions Single character: c
L(c) = { c } (for any c 2 ) Epsilon (empty string):
L() = {}
24
Compound Regular Expressions
Union (or choice)
A | B = { s | s 2 A or s 2 B } Concatenation: AB (where A and B are reg. exp.)
L(A B) = { | 2 L(A) and 2 L(B) } Note:
» AB (set concatenation) and (string concatenation) will be abbreviated to AB and , respectively.
Examples:» if | then | else = { if, then, else}» 0 | 1 | … | 9 = { 0, 1, …, 9 }» Another example: (0 | 1) (0 | 1) = { 00, 01, 10, 11 }
25
More Compound Regular Expressions
So far we do not have a notation for infinite languages
Iteration: A*
L(A*) = { } [ L(A) [ L(AA) [ L(AAA) [ … Examples:
0* = {, 0, 00, 000, …}
1 0* = { strings starting with 1 and followed by 0’s }
26
Example: Keyword
» Keyword: else or if or begin …
else | if | begin | …
27
Example: IntegersInteger: a non-empty string of digits
( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )*
problem: reuse complicated expression improvement: define intermediate reg. expr.
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
number = digit digit*
Abbreviation: A+ = A A*
28
Regular Definitions Names for regular expressions
» d1 =r1
» d2 =r2
» ...
» dn =rn
where ri over alphabet {d1, d2, ..., d i-1}
note: Recursion is not allowed.
29
Example» Identifier: strings of letters or digits, starting
with a letter
digit = 0 | 1 | ... | 9
letter = A | … | Z | a | … | z
identifier = letter (letter | digit) *
» Is (letter* | digit*) the same ?
30
Example: WhitespaceWhitespace: a non-empty sequence of blanks,
newlines, CRNL and tabs
WS = (\ | \t | \n | \r\n )+
31
Example: Email Addresses Consider [email protected]
= letters [ { ., @ }
name = letter+
address = name ‘@’ name (‘.’ name)*
32
Notational Shorthands One or more instances
» r+ = r r*» r* = r+ |
Zero or one instance» r? = r |
Character classes» [abc] = a | b | c» [a-z] = a | b | ... | z» [ac-f] = a | c | d | e | f» [^ac-f] = – [ac-f]
33
Summary Regular expressions describe many useful
languages
Regular languages are a language specification» We still need an implementation
problem: Given a string s and a rexp R, is
( )?s L R
34
Implementation of Lexical Analysis
35
Outline Specifying lexical structure using regular
expressions
Finite automata» Deterministic Finite Automata (DFAs)» Non-deterministic Finite Automata (NFAs)
Implementation of regular expressions
RegExp => NFA => DFA =>optimized DFA
=> Tables
36
Regular Expressions in Lexical Specification
Last lecture: a specification for the predicate
s L(R) But a yes/no answer is not enough ! Instead: partition the input into lexemes
We will adapt regular expressions to this goal
37
Regular Expressions => Lexical Spec. (1)
1. Select a set of tokens• Number, Keyword, Identifier, ...
2. Write a rexp for the lexemes of each token• Number = digit+
• Keyword = if | else | …• Identifier = letter (letter | digit)*• OpenPar = ‘(‘• …
38
Regular Expressions => Lexical Spec. (2)
3. Construct R, matching all lexemes for all tokens
R = Keyword | Identifier | Number | …
= R1 | R2 | R3 + …
Facts: If s 2 L(R) then s is a lexeme
» Furthermore s 2 L(Ri) for some “i”
» This “i” determines the token that is reported
39
Regular Expressions => Lexical Spec. (3)
4. Let the input be x1…xn
(x1 ... xn are characters in the language alphabet)
• For 1 i n check
x1…xi L(R) ?
5. It must be that
x1…xi L(Rj) for some j
6. Remove x1…xi from input and go to (4)
40
How to Handle Spaces and Comments?
1. We could create a token Whitespace Whitespace = (‘ ’ + ‘\n’ + ‘\r’ + ‘\t’)+
» We could also add comments in there» An input “ \t\n 5555 “ is transformed into Whitespace Integer Whitespace
2. Lexer skips spaces (preferred)• Modify step 5 from before as follows:
It must be that xk ... xi 2 L(Rj) for some j such that x1 ... xk-1 2 L(Whitespace)
• Parser is not bothered with spaces
41
Ambiguities (1) There are ambiguities in the algorithm
How much input is used? What if
– x1…xi L(R) and also
– x1…xK L(R)
» Rule: Pick the longest possible substring
» The maximal match
42
Ambiguities (2) Which token is used? What if
– x1…xi L(Rj) and also
– x1…xi L(Rk)
» Rule: use rule listed first (j if j < k)
Example:
» R1 = Keyword and R2 = Identifier
» “if” matches both.
» Treats “if” as a keyword not an identifier
43
Error Handling What if
No rule matches a prefix of input ? Problem: Can’t just get stuck … Solution:
» Write a rule matching all “bad” strings» Put it last
Lexer tools allow the writing of:
R = R1 | ... | Rn | Error
» Token Error matches if nothing else matches
44
Summary Regular expressions provide a concise notation
for string patterns Use in lexical analysis requires small extensions
» To resolve ambiguities» To handle errors
Good algorithms known (next)» Require only single pass over the input» Few operations per character (table lookup)
45
Finite Automata Regular expressions = specification Finite automata = implementation
A finite automaton consists of» An input alphabet » A set of states S» A start state n» A set of accepting states F S» A set of transitions state input state
46
Finite Automata Transition
s1 a s2
Is read
In state s1 on input “a” go to state s2
If end of input (or no transition possible)» If in accepting state => accept» Otherwise => reject
47
Finite Automata State Graphs
A state
• The start state
• An accepting state
• A transitiona
48
A Simple Example A finite automaton that accepts only “1”
1
49
Another Simple Example A finite automaton accepting any number of 1’s
followed by a single 0 Alphabet: {0,1}
0
1
50
And Another Example Alphabet {0,1} What language does this recognize?
0
1
0
1
0
1
51
And Another Example Alphabet still { 0, 1 }
The operation of the automaton is not completely defined by the input» On input “11” the automaton could be in either
state
1
1
52
Epsilon Moves Another kind of transition: -moves
• Machine can move from state A to state B without reading input
A B
53
Deterministic and Nondeterministic Automata
Deterministic Finite Automata (DFA)» One transition per input per state » No -moves
Nondeterministic Finite Automata (NFA)» Can have multiple transitions for one input in a
given state» Can have -moves
Finite automata have finite memory» Need only to encode the current state
54
Execution of Finite Automata A DFA can take only one path through the state
graph» Completely determined by input
NFAs can choose» Whether to make -moves» Which of multiple transitions for a single input
to take
55
Acceptance of NFAs An NFA can get into multiple states
• Input:
0
1
1
0
1 0 1
• Rule: NFA accepts it it can get in a final state
56
Acceptance of a Finite Automata
A FA (DFA or NFA) accepts an input string s iff there is some path in the transition diagram from the start state to some final state such that the edge labels along this path spell out s
57
NFA vs. DFA (1) NFAs and DFAs recognize the same set of
languages (regular languages)
DFAs are easier to implement» There are no choices to consider
58
NFA vs. DFA (2) For a given language the NFA can be simpler
than the DFA
01
0
0
01
0
1
0
1
NFA
DFA
• DFA can be exponentially larger than NFA
59
Operations on NFA states -closure(s): set of NFA states reachable from NFA
state s on -transitions alone -closure(S): set of NFA states reachable from some
NFA state s in S on -transitions alone move(S, c): set of NFA states to which there is a
transition on input symbol c from some NFA state s in S
notes: » -closure(S) = Us S∈ -closure(S);» -closure(s) = Us S∈ -closure({s});» -closure(S) = ?
60
Computing -closure Input. An NFA and a set of NFA states S. Output. E = -closure(S).begin
push all states in S onto stack; T := S;while stack is not empty do begin
pop t, the top element, off of stack;for each state u with an edge from t to u labeled do
if u is not in T do begin add u to T; push u onto stackend
end;return T
end.
61
Simulating an NFA Input. An input string ended with eof and an NFA with start
state s0 and final states F. Output. The answer “yes” if accepts, “no” otherwise.begin
S := -closure({s0});c := next_symbol();while c != eof do beginS := -closure(move(S, c));c := next_symbol();end;if S F != then return “yes”else return “no”
end.
62
Regular Expressions to Finite Automata
High-level sketch
Regularexpressions
NFA DFA
LexicalSpecification
Table-driven Implementation of DFA
Optimized DFA
63
Regular Expressions to NFA (1)
For each kind of rexp, define an NFA» Notation: NFA for rexp A
A
• For
• For input aa
64
Regular Expressions to NFA (2)
For AB
A B
• For A + B
A
B
65
Regular Expressions to NFA (3)
For A*
A
66
Example of RegExp -> NFA conversion
Consider the regular expression
(1+0)*1 The NFA is
1C E
0D F
B
G
A H 1I J
67
NFA to DFA
68
Regular Expressions to Finite Automata
High-level sketch
Regularexpressions
NFA DFA
LexicalSpecification
Table-driven Implementation of DFA
Optimized DFA
69
RegExp -> NFA :an Examlpe
Consider the regular expression
(1+0)*1 The NFA is
1C E
0D F
B
G
A H 1I J
70
NFA to DFA. The Trick Simulate the NFA Each state of DFA
= a non-empty subset of states of the NFA Start state
= the set of NFA states reachable through -moves from NFA start state
Add a transition S a S’ to DFA iff» S’ is the set of NFA states reachable from any
state in S after seeing the input a– considering -moves as well
71
NFA -> DFA Example
10 1
A BC
D
E
FG H I J
ABCDHI
FGABCDHI
EJGABCDHI
0
1
0
10 1
72
NFA to DFA. Remark An NFA may be in many states at any time
How many different states ?
If there are N states, the NFA must be in some subset of those N states
How many non-empty subsets are there?» 2N - 1 = finitely many
73
From an NFA to a DFA Subset construction Algorithm. Input. An NFA N. Output. A DFA D with states S and trasition table mv.begin
add -closure(s0) as an unmarked state to S;while there is an unmarked state T in S do begin
mark T;for each input symbol a do begin
U := -closure(move(T, a));if U is not in S then
add U as an unmarked state to S;mv[T, a] := U
end end end.
74
Implementation A DFA can be implemented by a 2D table T
» One dimension is “states”» Other dimension is “input symbols”
» For every transition Si a Sk define mv[i,a] = k DFA “execution”
» If in state Si and input a, read mv[i,a] = k and skip to state Sk
» Very efficient
75
Table Implementation of a DFA
S
T
U
0
1
0
10 1
0 1
S T U
T T U
U T U
76
Simulation of a DFA Input. An input string ended with eof and a DFA with start state
s0 and final states F.Output. The answer “yes” if accepts, “no” otherwise.begin
s := s0;c := next_symbol();while c <> eof do begin
s := mv(s, c); c := next_symbol() end; if s is in F then return “yes” else return “no”end.
77
Implementation (Cont.) NFA -> DFA conversion is at the heart of tools
such as flex
But, DFAs can be huge» DFA => optimized DFA : try to decrease the
number of states. » not always helpful!
In practice, flex-like tools trade off speed for space in the choice of NFA and DFA representations
78
Time-Space Tradeoffs RE to NFA, simulate NFA
» time: O(|r| * |x|) , space: O(|r|) RE to NFA, NFA to DFA, simulate DFA
» time: O(|x|), space: O(2|r|) Lazy transition evaluation
» transitions are computed as needed at run time;
» computed transitions are stored in cache for later use
79
DFA to optimized DFA
80
MotivationsProblems:1. Given a DFA M with k states, is it possible to find an
equivalent DFA M’ (I.e., L(M) = L(M’)) with state number fewer than k ?
2. Given a regular language A, how to find a machine with minimum number of states ?
Ex: A = L((a+b)*aba(a+b)*) can be accepted by the following NFA:
By applying the subset construction, we can constructa DFA M2 with 24=16 states, of which only 6 are accessible from the initial state {s}.
s t u v
a b a
a,b a,b
81
Inaccessible states A state p Q is said to be inaccessible (or
unreachable) [from the initial state] if there exists no path from from the initial state to it. If a state is not inaccessible, it is accessible.
Inaccessible states can be removed from the DFA without affecting the behavior of the machine.
Problem: Given a DFA (or NFA), how to find all inaccessible states ?
82
Finding all accessible states:
(like e-closure) Input. An FA (DFA or NFA) Output. the set of all accessible statesbegin
push all start states onto stack; Add all start states into A;
while stack is not empty do beginpop t, the top element, off of stack;for each state u with an edge from t to udo
if u is not in A do begin add u to A; push u onto stackend
end;return A end.
83
Minimization process Minimization process for a DFA:
» 1. Remove all inaccessible states» 2. Collapse all equivalent states
What does it mean that two states are equivalent?» both states have the same observable behaviors.i.e.,» there is no way to distinguish their difference, or» more formally, we say p and q are not equivalent(or
distinguishable) iff there is a string x * s.t. exactly one of (p,x) and (q,x) is a final state,
» where (p,x) is the ending state of the path from p with x as the input.
Equivalents sates can be merged to form a simpler machine.
84
0
1
2 4
3a
aa,b
a,bab
b b5
a,b
0 5
a,b
1,2 3,4a,b a,b a,b
Example:
85
Quotient Construction M=(Q,, ,s,F): a DFA. : a relation on Q defined by:
p q <=>for all x * (p,x) F iff (q,x) FProperty: is an equivalence relation. Hence it partitions Q into equivalence classes [p] = {q Q | p q} for p Q. and the quotient set
Q/ = {[p] | p Q}.Every p Q belongs to exactly one class [p] and p q iff [p]=[q].
Define the quotient machine M/ = <Q’,, ’,s’,F’> where» Q’=Q/ ; s’=[s]; F’={[p] | p F}; and’([p],a)=[(p,a)] for all p Q and a .
86
Minimization algorithm input: a DFA output: a optimized DFA
1. Write down a table of all pairs {p,q}, initially unmarked.
2. mark {p,q} if p F and q ∈ F or vice versa.
3. Repeat until no more change:
3.1 if unmarked pair {p,q} s.t. {move(p,q), move(q,a)} is ∃marked for some a S, then mark {p,q}.∈
4. When done, p q iff {p,q} is not marked.
5. merge all equivalent states into one class and return the resulting machine
87
An Example: The DFA:
a b
>0 1 2
1F 3 4
2F 4 3
3 5 5
4 5 5
5F 5 5
88
Initial Table
1 -
2 - -
3 - - -
4 - - - -
5 - - - - -
0 1 2 3 4
89
After step 2
1 M
2 M -
3 - M M
4 - M M -
5 M - - M M
0 1 2 3 4
90
After first pass of step 3
1 M
2 M -
3 - M M
4 - M M -
5 M M M M M
0 1 2 3 4
91
2nd pass of step 3. The result : 1 2 and 3 4.
1 M
2 M -
3 M2 M M
4 M2 M M -
5 M M1 M1 M M
0 1 2 3 4