cdunit_1

UNIT – 1

Overview of compiler Environment pass and phase phases of compiler regular expression Lexical Analyzer LEX tool Bootstrapping.

Compiler - Introduction

• A compiler is a computer program that translates a program in a source language into an equivalent program in a target language.

• A source program/code is a program/code written in the source language, which is usually a high-level language.

• A target program/code is a program/code written in the target language, which often is a machine language or an intermediate code.

compilerSource program

Target program

Error message

Input

Output

A language-processing system

3

Preprocessor

Compiler

Assembler

Linker

Skeletal Source Program

Source Program

Target Assembly Program

Relocatable Object Code

Absolute Machine Code

Libraries andRelocatable Object Files

Try for example:gcc -v myprog.c

The Economy of Programming Languages

Why are there so many programming languages? - Application domains have distinctive/conflicting needs.

Why are there new programming languages?- Programmer training is the dominant cost

What is a good programming language?- There is no universally accepted metric

Why Study Compilers?

● Build a large, ambitious software system.● See theory come to life.● Learn how to build programming languages.● Learn how programming languages work.● Learn tradeoffs in language design.

Building a compiler requires knowledge of• programming languages (parameter passing, variable scoping, memory allocation, etc)• theory (automata, context-free languages, etc)• algorithms and data structures (hash tables, graph algorithms, dynamic programming etc)• computer architecture (assembly programming)• software engineering.

token stream

syntax tree

syntax tree

intermediate representation

intermediate representation

Phases of a Compiler

Lexical analyzer

Syntax analyzer

Semantic analyzer

Intermediate code generator

Code Optimizer

Code generator

ErrorHandlerSymbol

Table

Target program

Source program

The Structure of a Compiler :The Analysis-Synthesis Model of Compilation

• There are two parts to compilation:– Analysis

• Breaks up source program into pieces and imposes a grammatical structure

• Creates intermediate representation of source program• Determines the operations and records them in a tree structure,

syntax tree• Known as front end of compiler

7

– Synthesis • Constructs target program from intermediate representation • Takes the tree structure and translates the operations into the

target program• Known as back end of compiler

Front End Back End Source Intermediate Target Code Code Code

• Three Phases:– Linear / Lexical Analysis:

• L-to-R Scan to Identify Tokenstoken: sequence of chars having a collective meaning

– Hierarchical Analysis:• Grouping of Tokens Into Meaningful Collection

– Semantic Analysis:• Checking to ensure Correctness of Components

The Analysis Task For Compilation

Phase 1. Lexical Analysis

All are tokens

Blanks, Line breaks, etc. are scanned out

Position = initial + rate * 60 ;_______ __ _____ _ ___ _ __ _

•First step: recognize words. –Smallest unit above letters

This is a sentence. •Lexical analysis divides program text into “words” or “tokens”

if (x == y) z = 1; else z = 2;

Once words are understood, the next step is to understand sentence structure

•Parsing = Diagramming Sentences–The diagram is a tree

Phase 2. Hierarchical AnalysisParsing or Syntax Analysis

For previous example,we would have Parse Tree:

identifier

identifier

expression

identifier

expression

number

expression

expression

expression

assignment statement

position

=

+

*

60

initial

rate

Nodes of tree are constructed using a grammar for the language

Phase 3. Semantic Analysis

• Find More Complicated Semantic Errors and Support Code Generation• Parse Tree Is Augmented With Semantic Actions

position

initial

rate

:=+

*

60

Compressed Tree

position

initial

rate

:=+

*

inttofloat

60

Conversion Action

• Most Important Activity in This Phase:• Type Checking - Legality of Operands• Many Different Situations: Float = int + char ;

A[int] = A[float] + int ;while (char != int) …. Etc.

Lexical Analyzer

Syntax Analyzer

Semantic Analyzer

character stream position = initial + rate * 60

<id,1> <=> <id,2> <+> <id,3> <*> <60>

=

<id,1>

<id,2>

<id,3>

+

*

60

=

<id,1>

<id,2>

<id,3>

+

*

inttofloat

60

Intermediate Code Generator

Machine-IndependentCode Optimizer

Code Generator

t1 = inttofloat(60)t2 = id3 * t1t3 = id2 + t2id1 = t3

t1 = id3 * 60.0id1 = id2 + t1

LDF R2, id3MULF R2, R2, #60.0LDF R1, id2ADDF R1, R1, R2STF id1, R1

1 position …

2 initial …

3 rate …

SYMBOL TABLE

Phases and Passes Pass Phase

Pass is a physical scan over a source program

A phase is a logically cohesive operation that takes i/p in one form and produces o/p in another form

The portions of one or more phases are combined into a module called pass

Requires an intermediate file between two passes

No need of any intermediate files in between phases

Splitting into more no. of passes reduces memory

Splitting into more no. of phases reduces the complexity of the program

Single pass compiler is faster than two pass

Reduction in no. of phases, increases the execution speed

Lexical Analysis

if (i == j)Z = 0;

elseZ = 1;

The compiler sees the following code as

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

•Token Class (or Class) –In English: –In a programming language:

Noun, verb, adjective ….. Identifiers, keywords, operators, numbers..Token

- A classification for a common set of strings

- <Identifier> <Number> etc…Pattern- The rules which characterize the set of strings for a token

- File and OS wild cards *.* [A-Z]Lexeme- Actual sequence of characters that matches pattern and is classified

by a token

- Identifiers : x, count , . . .

•Token classes correspond to sets of strings.

•Identifier:–strings of letters or digits, starting with a letter•Integer:–a non-empty string of digits•Keyword:–“else” or “if” or “begin” or …•Whitespace:–a non-empty sequence of blanks, newlines, and tabs

•Classify program substrings according to role •Communicate tokens to the parser

Lexical Analyzer Parser

<Class, String>

•An implementation must do two things:

1.Recognize substrings corresponding to tokens •The lexemes

2.Identify the token class of each lexeme

Find the No.of Tokens in the following Code segments:1.printf(“ Compiler Design”);2.DO I = 15.5;3.Int add(int x,int y)

{return x+y;

}4. printf(“ i = %d , $i = %p”,i,&i);

\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

Complexity in Lexical Analysis

•FORTRAN rule: Whitespace is insignificant •VAR1 is the same as VA R1

DO 5 I = 1,25

DO 5 I = 1.25

if (i == j)Z = 0;elseZ = 1;

PL/I keywords are not reservedIF ELSE THEN THEN = ELSE; ELSE ELSE = THEN

•C++ template syntax:Foo<Bar>

•C++ stream syntax:cin >> var;

•The goal of lexical analysis is to–Partition the input string into lexemes–Identify the token of each lexeme•Left-to-right scan => lookahead sometimes required

Regular Languages

•Lexical structure = token classes•We must say what set of strings is in a token class–Use regular languages

•Regular expressions specify regular languages•Five constructs

–Two base cases•empty and 1-character strings

–Three compound expressions•union, concatenation, iteration

•Def. The regular expressions over S are the smallest set of expressions including

R = | ‘c’ c is in | R + R | RR | R*

RE examples :

For = {0,1}, Find the strings represented by the following Res1 . 1* =

2. (1 + 0) 1 =

3. 0* + 1* =

4. (0+1)* =

Formal LanguagesDef. Let be a set of characters (an alphabet).A language over is a set of strings of characters drawn from

•Alphabet = English characters •Language = English sentences

•Alphabet = ASCII •Language = C programs

Meaning function L maps syntax to semanticsL (e) = M

•Why use a meaning function? –Makes clear what is syntax, what is semantics. –Allows us to consider notation as a separate issue –Because expressions and meanings are not 1-1

•Meaning is many to one –Never one to many!

Lexical Specifications Keyword: “if” or “else” or “then” or …

Integer: a non-empty string of digits

Identifier: strings of letters or digits, starting with a letter

Whitespace: a non-empty sequence of blanks, newlines, and tabs

digit = '0' +'1'+'2'+'3'+'4'+'5'+'6'+'7'+'8'+'9' digits = digit+ opt_fraction = ('.' digits) + opt_exponent = ('E' ('+' + '-' + ) digits) + num = digits opt_fraction opt_exponent

•At least one: A+ = AA* •Union: A | B = A + B •Option: A? = A + •Range: ‘a’+’b’+…+’z’ = [a-z] •Excluded range: complement of [a-z] = [^a-z]

Lexical Specification Process1.Write a rexp for the lexemes of each token class •Number = digit+ •Keyword = ‘if’ + ‘else’ + … •Identifier = letter (letter + digit)* •OpenPar = ‘(‘ 2.Construct R, matching all lexemes for all tokens

R = Keyword + Identifier + Number + …

= R1 + R2 + …

3.Let input be x1…xn

For 1 i n check x1…xi L(R)

4.If success, then we know that x1…xi L(Rj) for some j

5.Remove x1…xi from input and go to (3)

Resolving Ambiguities

•How much input is used?- “Maximal Munch”

•Which token is used?- Choose the one listed first

•What if no rule matches?- Pass on to error handler

Lexical errors• Some errors are out of power of lexical analyzer to recognize:

– fi (a == f(x)) …• However it may be able to recognize errors like:

– d = 2r• Such errors are recognized when no pattern for tokens matches a

character sequence

Error recovery

• Panic mode: successive characters are ignored until we reach to a well formed token

• Delete one character from the remaining input• Insert a missing character into the remaining input• Replace a character by another character• Transpose two adjacent characters

Input buffering

• Sometimes lexical analyzer needs to look ahead some symbols to decide about the token to return– In C language: we need to look after -, = or < to decide what token to

return• We need to introduce a two buffer scheme to handle large look-aheads

safely

• Two buffers of the same size, say 4096, are alternately reloaded.• Two pointers to the input are maintained:

– Pointer lexeme_Begin marks the beginning of the current lexeme.– Pointer forward scans ahead until a pattern match is found.

Switch (*forward++) {case eof:

if (forward is at end of first buffer) {reload second buffer;forward = beginning of second buffer;

}else if {forward is at end of second buffer) {

reload first buffer;\forward = beginning of first buffer;

}else /* eof within a buffer marks the end of input */

terminate lexical analysis;break;

cases for the other characters;}

Transition diagrams• Transition diagram for relop

• Transition diagram for reserved words and identifiers

• Transition diagram for unsigned numbers

• Transition diagram for whitespace

cdunit_1

Documents