compiler design unit 2. lexical analysis...xqfwlrqv &rpsxwhg )urp wkh 6\qwd[ 7uhh...
TRANSCRIPT
UNIT 2: LEXICAL ANALYSIS
1
Sadique NayeemAsst. ProfessorDept. of CSE
Sitamarhi Institute of Technology, Sitamarhi
Lexical Analysis
Being the first phase of a compiler, the main task of the lexical analyzeris to: Read the input characters of the source program, Group them into lexemes, and Produce as output a sequence of tokens for each lexeme in the
source program.source program.
The stream of tokens is sent to the parser for syntax analysis.
It is common for the lexical analyzer to interact with the symbol table aswell.
Another task of LA is stripping out comments and whitespace (blank,newline, tab).
Another task is correlating error messages generated by the compilerwith the source program.
getNextToken
Commonly, the interaction is implemented by having the parsercall the lexical analyzer. The call, suggested by thegetNextToken command, causes the lexical analyzer to readcharacters from its input until it can identify the next lexemeand produce for it the next token, which it returns to the parser.and produce for it the next token, which it returns to the parser.
Sometimes, lexical analyzers are divided into a cascade of twoprocesses:
a) Scanning consists of the simple processes that do not requiretokenization of the input, such as deletion of comments andcompaction of consecutive whitespace characters into one.compaction of consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where thescanner produces the sequence of tokens as output.
All Program have
Keywords
Operator
Identifiers
Constants (number and strings)
Punctuation marks
Token
A token is a pair consisting of a token name and an optionalattribute value.
<token name, attribute value>
The token name is an abstract symbol representing a kind oflexical unit, e.g., a particular keyword, or a sequence of inputlexical unit, e.g., a particular keyword, or a sequence of inputcharacters denoting an identifier.
The token names are the input symbols that the parserprocesses.
Pattern
A Pattern is a description of the form that the lexemes of a tokenmay take.
In the case of a keyword as a token, the pattern is just thesequence of characters that form the keyword. (Example: if)
For identifiers and some other tokens, the pattern is a more For identifiers and some other tokens, the pattern is a morecomplex structure that is matched by many strings. (Example: age)
Lexeme
A lexeme is a sequence of characters in the source programthat matches the pattern for a token and is identified by thelexical analyzer as an instance of that token.
#include<stdio.h> #include<stdio.h>#include<stdio.h>
void main()
{
printf(“SIT, Sitamarhi”);
}
#include<stdio.h>void main(){
int a=10, b=20, c;c = a + b;printf(“%d”, c);
}
Examples of Tokens
GATE 2000
printf("i = %d, &i = %x", i, &i);
Lexical Errors
These errors are mainly the spelling mistakes and accidentalinsertion of foreign character if the language does not allow it.
It is hard for a lexical analyzer to tell, without the aid of othercomponents, that there is a source-code error.
For instance, if the string fi is encountered for the first time in a C For instance, if the string fi is encountered for the first time in a Cprogram in the context:
fi ( a == 10 )
A lexical analyzer cannot tell whether fi is a misspelling of thekeyword if or an undeclared function identifier. Since fi is a validlexeme for the token id, the lexical analyzer must return thetoken id to the parser and let some other phase of the compiler— probably the parser in this case — handle an error due totransposition of the letters.
Suppose a situation arises in which the lexical analyzer is unableto proceed because none of the patterns for tokens matches anyprefix of the remaining input.
The simplest recovery strategy is "panic mode" recovery. Wedelete successive characters from the remaining input, until thedelete successive characters from the remaining input, until thelexical analyzer can find a well-formed token at the beginning ofwhat input is left.
Other possible error-recovery actions are:
1. Delete one character from the remaining input.
2. Insert a missing character into the remaining input.
3. Replace a character by another character.
4. Transpose two adjacent characters.
Specification of Tokens
Alphabet
String
Language
Operation on Language (U , . , * , +)
Kleen Closure and Positive Closure
Transition Table
ε- Closure
RE to ε- NFA
ε- NFA to NFA
NFA to DFAKleen Closure and Positive Closure
Regular Expression
Transition Diagram
Finite Automata
NFA
DFA
ε- NFA
NFA to DFA
DFA Minimizations
Regular Definitions
ε- NFA
NFA RE
DFA
Regular expression can be represented by its syntax tree,where the leaves correspond to operands and the interiornodes correspond to operators.
An interior node is called a cat-node, or-node, or star-node if itis labeled by the concatenation operator (dot), union operator
RE to DFA
is labeled by the concatenation operator (dot), union operator|, or star operator *, respectively.
Leaves in a syntax tree arelabeled by ε or by an alphabetsymbol. To each leaf not labeledε, we attach a unique integer.
We refer to this integer as theposition of the leaf and also as aposition of its symbol.
Construct Syntax tree
a(a|b)*#
(a|b)c*#
(a|b) (a|b)#
(a|b)*(a|b)# (a|b)*(a|b)#
Functions Computed From the Syntax Tree
To construct a DFA directly from a regular expression, we construct itssyntax tree and then compute four functions: nullable, firstpos, lastpos,and followpos, defined as follows. Each definition refers to the syntaxtree for a particular augmented regular expression ( r ) #.
1. nullable(n) is true for a syntax-tree node n if and only if thesubexpression represented by n has ε in its language. That is, thesubexpression represented by n has ε in its language. That is, thesubexpression can be "made null" or the empty string, even thoughthere may be other strings it can represent as well.
2. firstpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the first symbol of at least one string in the languageof the subexpression rooted at n. (From where the starting positionelement of the sting is coming)
3. lastpos(n) is the set of positions in the subtree rooted at n thatcorrespond to the last symbol of at least one string in the languageof the subexpression rooted at n. (From where the last positionelement of the sting is coming)
4. followpos(p), is the set of position q that can match the first or lastsymbol of the string generated by a given subexpression of asymbol of the string generated by a given subexpression of aregular expression.
Computing nullable, firstpos, and lastpos
lastpos(n)
Ø
{i}
lastpos(c1) U lastpos(c2)
If (nullable(c2)) (lastpos(c1) U lastpos(c2)) else lastpos(c2)
lastpos(c1)
C2C1
*
C2FP1 LP1 FP2 LP2
Computing followpos
Converting a Regular Expression Directly to a DFA
Step1. Construct a syntax tree T from the augmented regularexpression ( r ) #.
Step 2. Compute nullable, firstpos, lastpos, and followpos for T.
Step 3. Construct Dstates (set of states of DFA D) and Dtran (transitionfunction for D) by using following procedure.
The states of D are sets of positions in T.
Initially, each state is "unmarked," and a state becomes "marked"just before we consider its out-transitions.
The start state of D is firstpos(no), where node ‘no’ is the root of T.
The accepting states are those containing the position for theendmarker symbol #.
The value of firstpos for the root of the tree is {1,2,3}, so this set is the start state of D.
Let us Call this set of states A.
We must compute Dtran[A, a] and Dtran[A, b].
Among the positions of A, leaf 1 and leaf 3 correspond to a, while leaf 2 correspondsto b. Thus,
Dtran[A,a] = followpos(l) U followpos(3) = {1,2,3,4} B
Dtran[A, b] = followpos{2) = {1,2,3} A
Dtran[B, a] = followpos(l) U followpos(3) = {1,2,3,4} B
Dtran[B, b] = followpos(2) U followpos(4) = {1,2,3,5} C
Dtran[C, a] = followpos(l) U followpos(3) = {1,2,3,4} B
Dtran[C, b] = followpos(2) U followpos(5) = {1,2,3,6} D
Dtran[D, a] = followpos(l) U followpos(3) = {1,2,3,4} B
Dtran[D, b] = followpos(2) = {1,2,3} A
A B C D
Note: We can also minimize the resultant DFA.
A B C D
Question Time
Q. Find DFA from following regular expression.
a(a|b)*#
(a|b)c*#
40
THANK YOU!