cdunit_1
DESCRIPTION
Compiler Design - IntroductionTRANSCRIPT
UNIT – 1
Overview of compiler Environment pass and phase phases of compiler regular expression Lexical Analyzer LEX tool Bootstrapping.
Compiler - Introduction
• A compiler is a computer program that translates a program in a source language into an equivalent program in a target language.
• A source program/code is a program/code written in the source language, which is usually a high-level language.
• A target program/code is a program/code written in the target language, which often is a machine language or an intermediate code.
compilerSource program
Target program
Error message
Input
Output
A language-processing system
3
Preprocessor
Compiler
Assembler
Linker
Skeletal Source Program
Source Program
Target Assembly Program
Relocatable Object Code
Absolute Machine Code
Libraries andRelocatable Object Files
Try for example:gcc -v myprog.c
The Economy of Programming Languages
Why are there so many programming languages? - Application domains have distinctive/conflicting needs.
Why are there new programming languages?- Programmer training is the dominant cost
What is a good programming language?- There is no universally accepted metric
Why Study Compilers?
● Build a large, ambitious software system.● See theory come to life.● Learn how to build programming languages.● Learn how programming languages work.● Learn tradeoffs in language design.
Building a compiler requires knowledge of• programming languages (parameter passing, variable scoping, memory allocation, etc)• theory (automata, context-free languages, etc)• algorithms and data structures (hash tables, graph algorithms, dynamic programming etc)• computer architecture (assembly programming)• software engineering.
token stream
syntax tree
syntax tree
intermediate representation
intermediate representation
Phases of a Compiler
Lexical analyzer
Syntax analyzer
Semantic analyzer
Intermediate code generator
Code Optimizer
Code generator
ErrorHandlerSymbol
Table
Target program
Source program
The Structure of a Compiler :The Analysis-Synthesis Model of Compilation
• There are two parts to compilation:– Analysis
• Breaks up source program into pieces and imposes a grammatical structure
• Creates intermediate representation of source program• Determines the operations and records them in a tree structure,
syntax tree• Known as front end of compiler
7
– Synthesis • Constructs target program from intermediate representation • Takes the tree structure and translates the operations into the
target program• Known as back end of compiler
Front End Back End Source Intermediate Target Code Code Code
• Three Phases:– Linear / Lexical Analysis:
• L-to-R Scan to Identify Tokenstoken: sequence of chars having a collective meaning
– Hierarchical Analysis:• Grouping of Tokens Into Meaningful Collection
– Semantic Analysis:• Checking to ensure Correctness of Components
The Analysis Task For Compilation
Phase 1. Lexical Analysis
All are tokens
Blanks, Line breaks, etc. are scanned out
Position = initial + rate * 60 ;_______ __ _____ _ ___ _ __ _
•First step: recognize words. –Smallest unit above letters
This is a sentence. •Lexical analysis divides program text into “words” or “tokens”
if (x == y) z = 1; else z = 2;
Once words are understood, the next step is to understand sentence structure
•Parsing = Diagramming Sentences–The diagram is a tree
Phase 2. Hierarchical AnalysisParsing or Syntax Analysis
For previous example,we would have Parse Tree:
identifier
identifier
expression
identifier
expression
number
expression
expression
expression
assignment statement
position
=
+
*
60
initial
rate
Nodes of tree are constructed using a grammar for the language
Phase 3. Semantic Analysis
• Find More Complicated Semantic Errors and Support Code Generation• Parse Tree Is Augmented With Semantic Actions
position
initial
rate
:=+
*
60
Compressed Tree
position
initial
rate
:=+
*
inttofloat
60
Conversion Action
• Most Important Activity in This Phase:• Type Checking - Legality of Operands• Many Different Situations: Float = int + char ;
A[int] = A[float] + int ;while (char != int) …. Etc.
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
character stream position = initial + rate * 60
<id,1> <=> <id,2> <+> <id,3> <*> <60>
=
<id,1>
<id,2>
<id,3>
+
*
60
=
<id,1>
<id,2>
<id,3>
+
*
inttofloat
60
Intermediate Code Generator
Machine-IndependentCode Optimizer
Code Generator
t1 = inttofloat(60)t2 = id3 * t1t3 = id2 + t2id1 = t3
t1 = id3 * 60.0id1 = id2 + t1
LDF R2, id3MULF R2, R2, #60.0LDF R1, id2ADDF R1, R1, R2STF id1, R1
1 position …
2 initial …
3 rate …
SYMBOL TABLE
Phases and Passes Pass Phase
Pass is a physical scan over a source program
A phase is a logically cohesive operation that takes i/p in one form and produces o/p in another form
The portions of one or more phases are combined into a module called pass
Requires an intermediate file between two passes
No need of any intermediate files in between phases
Splitting into more no. of passes reduces memory
Splitting into more no. of phases reduces the complexity of the program
Single pass compiler is faster than two pass
Reduction in no. of phases, increases the execution speed
Lexical Analysis
if (i == j)Z = 0;
elseZ = 1;
The compiler sees the following code as
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
•Token Class (or Class) –In English: –In a programming language:
Noun, verb, adjective ….. Identifiers, keywords, operators, numbers..Token
- A classification for a common set of strings
- <Identifier> <Number> etc…Pattern- The rules which characterize the set of strings for a token
- File and OS wild cards *.* [A-Z]Lexeme- Actual sequence of characters that matches pattern and is classified
by a token
- Identifiers : x, count , . . .
•Token classes correspond to sets of strings.
•Identifier:–strings of letters or digits, starting with a letter•Integer:–a non-empty string of digits•Keyword:–“else” or “if” or “begin” or …•Whitespace:–a non-empty sequence of blanks, newlines, and tabs
•Classify program substrings according to role •Communicate tokens to the parser
Lexical Analyzer Parser
<Class, String>
•An implementation must do two things:
1.Recognize substrings corresponding to tokens •The lexemes
2.Identify the token class of each lexeme
Find the No.of Tokens in the following Code segments:1.printf(“ Compiler Design”);2.DO I = 15.5;3.Int add(int x,int y)
{return x+y;
}4. printf(“ i = %d , $i = %p”,i,&i);
\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
Complexity in Lexical Analysis
•FORTRAN rule: Whitespace is insignificant •VAR1 is the same as VA R1
DO 5 I = 1,25
DO 5 I = 1.25
if (i == j)Z = 0;elseZ = 1;
PL/I keywords are not reservedIF ELSE THEN THEN = ELSE; ELSE ELSE = THEN
•C++ template syntax:Foo<Bar>
•C++ stream syntax:cin >> var;
•The goal of lexical analysis is to–Partition the input string into lexemes–Identify the token of each lexeme•Left-to-right scan => lookahead sometimes required
Regular Languages
•Lexical structure = token classes•We must say what set of strings is in a token class–Use regular languages
•Regular expressions specify regular languages•Five constructs
–Two base cases•empty and 1-character strings
–Three compound expressions•union, concatenation, iteration
•Def. The regular expressions over S are the smallest set of expressions including
R = | ‘c’ c is in | R + R | RR | R*
RE examples :
For = {0,1}, Find the strings represented by the following Res1 . 1* =
2. (1 + 0) 1 =
3. 0* + 1* =
4. (0+1)* =
Formal LanguagesDef. Let be a set of characters (an alphabet).A language over is a set of strings of characters drawn from
•Alphabet = English characters •Language = English sentences
•Alphabet = ASCII •Language = C programs
Meaning function L maps syntax to semanticsL (e) = M
•Why use a meaning function? –Makes clear what is syntax, what is semantics. –Allows us to consider notation as a separate issue –Because expressions and meanings are not 1-1
•Meaning is many to one –Never one to many!
Lexical Specifications Keyword: “if” or “else” or “then” or …
Integer: a non-empty string of digits
Identifier: strings of letters or digits, starting with a letter
Whitespace: a non-empty sequence of blanks, newlines, and tabs
digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9' digits = digit+ opt_fraction = ('.' digits) + opt_exponent = ('E' ('+' + '-' + ) digits) + num = digits opt_fraction opt_exponent
•At least one: A+ = AA* •Union: A | B = A + B •Option: A? = A + •Range: ‘a’+’b’+…+’z’ = [a-z] •Excluded range: complement of [a-z] = [^a-z]
Lexical Specification Process1.Write a rexp for the lexemes of each token class •Number = digit+ •Keyword = ‘if’ + ‘else’ + … •Identifier = letter (letter + digit)* •OpenPar = ‘(‘ 2.Construct R, matching all lexemes for all tokens
R = Keyword + Identifier + Number + …
= R1 + R2 + …
3.Let input be x1…xn
For 1 i n check x1…xi L(R)
4.If success, then we know that x1…xi L(Rj) for some j
5.Remove x1…xi from input and go to (3)
Resolving Ambiguities
•How much input is used?- “Maximal Munch”
•Which token is used?- Choose the one listed first
•What if no rule matches?- Pass on to error handler
Lexical errors• Some errors are out of power of lexical analyzer to recognize:
– fi (a == f(x)) …• However it may be able to recognize errors like:
– d = 2r• Such errors are recognized when no pattern for tokens matches a
character sequence
Error recovery
• Panic mode: successive characters are ignored until we reach to a well formed token
• Delete one character from the remaining input• Insert a missing character into the remaining input• Replace a character by another character• Transpose two adjacent characters
Input buffering
• Sometimes lexical analyzer needs to look ahead some symbols to decide about the token to return– In C language: we need to look after -, = or < to decide what token to
return• We need to introduce a two buffer scheme to handle large look-aheads
safely
• Two buffers of the same size, say 4096, are alternately reloaded.• Two pointers to the input are maintained:
– Pointer lexeme_Begin marks the beginning of the current lexeme.– Pointer forward scans ahead until a pattern match is found.
Switch (*forward++) {case eof:
if (forward is at end of first buffer) {reload second buffer;forward = beginning of second buffer;
}else if {forward is at end of second buffer) {
reload first buffer;\forward = beginning of first buffer;
}else /* eof within a buffer marks the end of input */
terminate lexical analysis;break;
cases for the other characters;}
Transition diagrams• Transition diagram for relop
• Transition diagram for reserved words and identifiers