6_ch2.pdf

35
2.1 INTRODUCTION A compiler (or more generally, translator) is a program that translates a program written in one language into another. The different stages/phases of a compiler are categorized as follows: 1. Syntax analysis (scanning and parsing) 2. Semantic analysis (determining what a program should do) 3. Optimization (improving the performance of a program as indicated by some met- ric, typically execution speed and/or saving space requirement) 4. Code generation (generation and output of an equivalent program in some target language, often the instruction set of a CPU) Syntax analysis or parsing is a process of matching the structure of sentences of the language in accordance with a given grammar. Here, all the elements of the language con- structs are in linear representation, which means that each element in the sentence restricts the next element. This representation is applicable to both the sentence and the computer program. Each grammar of a language can generate an infinite number of sentences (in linear representation). Although a finite-size grammar is simple, it specifies a structure to generate an infinite number of sentences. Once a grammar is specified, the sentences generated from this grammar are said to be in the language supported by this grammar. The input for the syntax analyser is the stream of tokens generated by the scanner. The scanner program reads the source program character by character, forms the tokens, and sends it to the parser. The tokens are specified by a simple structure. For example, a letter followed by zero or more letters or digits defines a Pascal identifier. This chapter will enable the reader to • understand the need for a compiler • understand grammar and language theory • distinguish between the generations of computer languages • describe the evolution of computer languages • detail the stages of a compiler Introduction to Compilers 2

Upload: ankit-kumar

Post on 03-Sep-2015

8 views

Category:

Documents


3 download

TRANSCRIPT

  • 2.1 INTRODUCTION

    A compiler (or more generally, translator) is a program that translates a program written in one language into another. The different stages/phases of a compiler are categorized as follows:

    1. Syntax analysis (scanning and parsing)2. Semantic analysis (determining what a program should do)3. Optimization (improving the performance of a program as indicated by some met-

    ric, typically execution speed and/or saving space requirement)4. Code generation (generation and output of an equivalent program in some target

    language, often the instruction set of a CPU)

    Syntax analysis or parsing is a process of matching the structure of sentences of the language in accordance with a given grammar. Here, all the elements of the language con-structs are in linear representation, which means that each element in the sentence restricts the next element. This representation is applicable to both the sentence and the computer program. Each grammar of a language can generate an infi nite number of sentences (in linear representation). Although a fi nite-size grammar is simple, it specifi es a structure to generate an infi nite number of sentences. Once a grammar is specifi ed, the sentences generated from this grammar are said to be in the language supported by this grammar. The input for the syntax analyser is the stream of tokens generated by the scanner. The scanner program reads the source program character by character, forms the tokens, and sends it to the parser. The tokens are specifi ed by a simple structure. For example, a letter followed by zero or more letters or digits defi nes a Pascal identifi er.

    This chapter will enable the reader to

    understand the need for a compiler

    understand grammar and language theory

    distinguish between the generations of computer languages

    describe the evolution of computer languages

    detail the stages of a compiler

    Introduction to Compilers 2

  • Introduction to Compilers 31

    A recognizer or a parser, discussed in Chapter 4, could be developed to recognize the sentence generated by this grammar. This is normally called syntax checking, which checks whether the language used has been written according to the grammar. Following syntax checking, the source language under consideration is represented in an interme-date form to be used by the code generator to generate the final machine code represent-ing the source code. This chapter outlines the process of compilation. Let us first discuss the theory of languages, and then computer languages.

    Following the process of parsing, the semantics (meaning) associated with the lan-guage is checked to some extent. For example, checking whether an integer data can be added with real data is done in a semantic analyser. If the syntax and semantic analysis phases are successful, an intermediate code (IC) representing the source code will be generated. Then, the IC will be subjected to an optional optimization process, and it will be converted into machine code. All the phases involved in the compiler are discussed briefly in this chapter.

    Before designing the compiler, it is important to understand the structure of the lan-guage. Language is always characterized by grammar (a set of rules). In order to design a compiler, the recognizer of the language should be fed with the grammar (set of rules). A variety of computer languages have evolved; some have survived, but some vanished in a short period of time. The strength of a computer language depends on its expressive power. However, there is always a need for a trade-off between the power of the language and its complexity of implementation.

    2.2 THEORY OF COMPUTER LANGUAGES

    Language, in general, is used for communication. A language is a set of words coined according to some rules for the purpose of communication. If the formation and use of the words follow strict rules, it is called formal language. Otherwise, it is simply a natural language, that is, the way we normally speak. Words are coined by juxtaposing characters in the alphabet of a language. For example, az are alphabets of the English language. A combination of these alphabets, such as Raman, Sita, eat, live, and world, are words of the English language. Languages evolved over time by categorizing words as nouns, verbs, and so on, and by applying certain guidelines/rules for their formation. The formation of words and construction of languages (string of words) were based on human needs, and they evolved over time. Ease of use and expressiveness are some features expected by the user of a language. Ultimately, the purpose of communication through a given language is to convey the message in a clear and precise manner.

    2.2.1 Natural Languages vs Formal Languages

    The term natural language defines a means of communication (for example, English, French, or Tamil) shared by a group of individuals. Natural languages do not have much usage restriction. As long as the user is able to communicate without misunderstanding, it is all right. Natural languages are understood by us because we possess intelligence. Any kind of incomplete/incorrect (in terms of syntax or semantics) information communicated

  • Compiler Design32

    between two persons may also be interpreted properly because of a certain context that is already established between them.

    Consider the following sentence.

    The cht caught the mouse.This sentence is clearly understood as

    The cat caught the mouse.

    Here, a context is already established between the words cat and mouse, and hence, even when the word cat is misspelled as cht, the human eye is able to interpret the appropriate meaning of the sentence. This is due to human intelligence. On the other hand, there is no guarantee that the misspelled sentence will be understood by a different group of people unless there is a context established among all of them.

    Let us see another example of a formal language where a semantic (or meaning) is associated with the formation of the sentence. Consider the two sentences:

    The mouse caught the cat. and The cat caught the mouse.

    Both the sentences are grammatically correct. However, the first sentence is not seman-tically acceptable. Hence, there is a need for a formal language, which has a different level of interpretation. The formal language is always associated with the set of rules that we refer to by the term grammar. Therefore, for every formal language, there is a gram-mar. In other words, each grammar will generate its own language. The complexity of the grammar determines the power of each language.

    Everyone feels comfortable with his/her own mother tongue owing to the long associa-tion with that language. The human system gets trained well with the repeated usage of the vocabulary and style of the language. People conversing in their mother tongue are able to communicate well. When communicating with people who speak in other lan-guages, they, in general, do not feel the same level of comfort. This is because the transla-tion process occurs both while speaking and while processing the received information.

    We can conclude that people speaking formal languages need to have knowledge of the language, that is, they must know the set of rules associated with the language and must be able to apply it during the process of recognition or understanding of the language. At the same time, recognizing a natural language is also a complex process. This is because, apart from having knowledge about the language, inference has to be made when the rules are not followed strictly, which needs an additional processing mechanism. Hence, formal language needs rules; informal language, in addition to rules, supports deviation from the usage of the rules, making the recognition process difficult.

    Humans need to communicate with computers to use them efficiently. The language used by the computer is called machine language or machine code. The code that reaches the processor, which is the main part of the computer, consists of a series of 0s and 1s known as binary code. The characteristics of human language vary widely. There is a need for a language that can balance the simplicity of the machine language and the complexity of the human language. Hence, a programming language is meant for the

  • Introduction to Compilers 33

    user to communicate with the computer. Basically, the computer or the machine works using a machine code, which is difficult for humans to understand. Hence, languages with restrictions (enforced by rules or grammar) were developed, which can be understood by humans with some extra effort. The code written in this type of language is transformed into machine code so that the processor can process it. Now, let us turn our attention to the relationship between grammar and language.

    2.2.2 Language and Grammar

    Language is a collection of sentences. Each sentence consists of words to communicate the intended message to others. Hence, languages have components at the following three levels:

    1. Symbols or character sets 2. Words 3. Sentences

    As seen in Section 2.2.1, there is always a close relationship between language and grammar. In other words, every language is governed by grammar, and every grammar produces a language. In primary schools where grammar is taught, we learn the set of rules by which the sentences of a language are coined. Exercises are given to check whether the given sentences conform to the grammar. Over time, we get familiar with the ways of constructing the language obeying the grammar rules. For computer languages, there is a need for a strict formal grammar by which the sentence formation specification is given.

    In a similar manner, computer languages have sentences, and these sentences possess structure; the sentences consist of words, which we call tokens. Each token, in addition to carrying a piece of information, contributes to the meaning of the whole sentence. The tokens cannot be broken down any further. Hence, a grammar is a finite set of rules, which may generate languages. What is size of the language one can speak? Of course, we can say there is a finite set of rules in the grammar book of a language. However, this gram-mar can produce an infinite size of language. How is this possible? The grammar that we specify consists of rewriting rules or recursive rules.

    The set of rewriting rules serves as a basis for formal grammar. This kind of rewriting systems, have a long history among mathematicians, especially the extensive study made by Chomsky in the year 1959. He laid the foundation for almost all the formal languages, parsers, and a considerable part of compiler construction. Since formal languages are a branch of mathematics, it is necessary to introduce the notations for representing the ele-ments involved in forming the languages.

    Before introducing the formal notation for the parts of the language and grammar, let us consider an example:

    Murugan eats mango.In this sentence, the words Murugan, eats, and mango are the subject, verb, and object,

    respectively. One English grammar rule for constructing a simple sentence is as follows:

    sentence subject verb objectSome examples of subjects are Raman, Murugan, and Sita. The words eat and buy are

    examples of verbs. The words mango and apple are examples of objects. With the given

  • Compiler Design34

    rule (part of grammar) and various examples/representatives of the subject, verb, and object, we can construct many sentences as listed here:

    Raman eats apple.Sita buys mango.The boy throws a ball.

    For the third sentence, the grammar and its derivation are given in Fig. 2.1. Here, the sentence (S) consists of a subject phrase (SP) and a verb phrase (VP). The subject phrase is in turn split into an article and a noun. The verb phrase consists of a verb and an object. The other elements in the grammar are self-explanatory.

    Fig. 2.1 Parts of a sentence and grammar elements

    S

    SP

    Noun

    Noun

    Verb ObjectArticle

    Article

    VP

    NP

    a ballThe boy throws

    There are various examples for each part of the grammar. Grammar rules can be written in different forms. With different combinations of this finite set of rewriting rules, theo-retically, a language of infinite size can be produced. If we use proper notations it will be convenient to represent the grammar for a language.

    2.2.3 Notations and ConventionsTo express the grammar in an unambiguous and consistent manner, notations will be helpful. Table 2.1 shows the various elements in the language construction process, with their notations and examples.

    Using the notations, a generative grammar (G) can be formally defined as G = (VN, VT, R, S) such that

    1. VN and VT are the finite set of symbols

  • Introduction to Compilers 35

    2. VN VT = 3. R is a set of pairs (P, Q) such that

    (a) P (VN VT)+ (b) Q (VN VT)*

    4. S VNR is the rewriting rule represented in the form P :: Q or P Q. We say P produces Q.

    Backus norm form In Backus norm form, the rewriting rule R is represented in the form

    ::=

    Table 2.1 Grammar notations and conventions

    S. no. Element name Notation Example

    1. Symbol, alphabet, or character set

    {a, b, , z} in English; {0, 1, ..., 9} in numeric language

    2. Grammar G G = (VT, VN, R, S)3. Set of terminals VT VT = {a, b, c,} starting with

    lower case letters in the beginning of the alphabet

    4. Set of non-terminals VN VN = {A, B, C,...} starting with upper case letters in the beginning of the alphabet

    5. Set of rewriting or production rules

    R R = {R1, R2, , Rn}

    6. Start symbol (any one of the non-terminals)

    S expr

    7. Grammar symbol W, X, Y, Z Upper case representation of w, x, y, and z

    8. String of terminals w, x, y w = abba9. String of grammar symbols , = XYZ

    10. Empty set 11. Variable or identifi er id int area, base, height; here area,

    base, and height are the variables or identifi ers

    12. Operators (arithmetic/relational/Boolean)

    op + * / | & < =

    13. Character constant cconst a p,...

    14. Numeric constant nconst 10, 20.345, 147,

    15. String constant sconst hello, home, ...

    16. Constant const a, 10, hello

  • Compiler Design36

    We say P produces Q. In the BNF notation, all the non-terminals are embedded between angular brackets < >. However, throughout our discussion, we will use the grammar rep-resentation P Q.

    2.2.4 Hierarchy of Formal Languages

    There is more than one way to write a grammar for generating a given language. To man-age the complexity of writing the grammar and to keep the power of language intact, Chomsky has described four levels (0 through 3) of hierarchical grammar and in turn, of languages. The objective to restrict the unmanageability of phrase structure grammars while keeping as much of their generative powers as possible has led to the hierarchy of grammars, as shown here:

    1. Type 0: unrestricted or phrase-structured grammar2. Type 1: context-sensitive grammar (CSG)3. Type 2: context-free grammar (CFG)4. Type 3: regular grammar (RG)

    Type 0 grammar is unrestricted grammar, which has the form P Q, where P and Q have no restrictions in their form except that P has to be non-empty. The part before the is called the left-hand side (LHS); the part after is called the right-hand side (RHS). The other types are derived by restricting the form of the rules in this grammar. Each of these restrictions allows the resulting grammars to be more easily understood, implemented, and manipulated, but gradually less powerful. However, they are still very useful; in fact, more useful than even type 0 in terms of compilers. In other words, type 0 through type 3, in that order, are easy to understand and represent a language in computer terms but restrict the way the language can be constructed. For example, type 3 grammar is restrict-ed grammar. However, it could be used to represent the structure of the words, strings, or tokens in a language.

    Type 2 grammar is more powerful than type 3. It is used to represent the structure of the sentence of the language, which is not possible with type 3 grammar. Type 2 gram-mar is CFG, where the syntax alone can be specified, not the semantics or the meaning associated with the language. To quote a few real-world examples, consider the following two sentences:

    Rama eats mango. Mango eats Rama.

    Both sentences will be accepted by the recognizer written based on type 2 grammar. However, these sentences will be distinguished by the recognizer written based on type 1 or CSG, which has a flexible way of representing the grammar for expressing the lan-guage powerfully. Table 2.2 compares these four formal grammars for their power of expressing the language and their ease of implementation.

    These four types of grammar play a role in describing the formal language.

  • Introduction to Compilers 37

    2.3 DESIGN OF A LANGUAGE

    Natural languages were not designed overnight. They have taken their own time to evolve. In addition, they have inherited many entities from other languages to make them more expressive and useful. In terms of computers and compilers, the design of a language depends on many factors and has evolved over the years, taking some applications into consideration. Recent language design stems from the fact that they should interoperate with many other languages, and should be recognized in more than one platform. Recent scenarios dictate the inculcation of the software engineering principles in the design and implementation of the computer languages. The language is expected to have many important features in terms of both the users and compilers used.

    2.3.1 Features of a Good LanguageA programming language is a sequence of strings used to convey the user message to the computer to execute some sequence of operations/instructions for obtaining a solu-tion to the given problem. It is a collection of programming language constructs such as keywords, variables, constants, and operators coined according to the grammar of the language.

    From the users perspective, the following is the list of expectations:

    1. Easy to understand2. Expressive power3. Interoperability4. Good turnaround time5. Portability6. Automatic error recovery

    Table 2.2 Comparison of grammar types and the generated languages

    Grammar left-hand side of production rule

    Right-hand side of production rule

    Example Implementation complexity

    power of grammar

    Type 0 or unrestricted grammar

    Any grammar symbol

    Any grammar symbol

    Very diffi cult Highly expressive

    Type 1 or CSG

    A limited combina-tion of terminals and non-terminals

    Any grammar symbols

    aA Diffi cult Medium expres-sive power

    Type 2 or CFG

    Only one non-terminal

    Any grammar symbol

    A Easy Restricted power

    Type 3 or regular or restricted

    Only one non-terminal

    Very limited combination of terminal and non-terminal

    A Aa | bA aA | b

    Very easy Very restricted power

  • Compiler Design38

    7. Good error reporting8. Efficient memory usage9. Provision of good run-time environment

    (a) Support for virtual machine(b) Support for concurrent operation(c) Support for unblocked operations

    10. Garbage collection11. Ability to interface with foreign functions12. Ability to model real-world problems13. Ability to expose the functions for usage in other languages

    This list is not exhaustive. The expectation of the user increases over time with differ-ent combinations of these expectations. The question is whether it is possible to meet all of them. Hence, the language and compiler developer has to keep track of these expecta-tions in the development process.

    2.3.2 Representation of LanguagesA language is a sequence or string of words or tokens. The structure or the format of the sequence is dictated by the grammar of the language. A sentence of the programming language may take different structures. The words in the language may take different formats. The structure of the language and format of the tokens have to be specified. The structure of the language has slightly complex specifications compared to the for-mat of the token. For example, an identifier in a programming language normally begins with a letter or underscore followed by zero or more occurrences of letters, digits, or underscores. No special symbols are allowed in an identifier. Hence, there are several restrictions applied in the formation of the tokens. Are these restrictions dictated by the language designer or the compiler developer? This question is answered by two factors:

    1. The programming language designer is satisfied with this restricted format for cer-tain types of language constructs. That means no elaborate format is required to represent the tokens such as identifiers, constants, and operators.

    2. The compiler developers desire a restricted grammar for easy implementation as far as possible.

    Hence, type 3 grammar is sufficient to define the format of the tokens in a program-ming language.

    On the other hand, the restricted grammar is not able to support a structure such as parenthesis-balanced expressions and nested control statements in forming the sentence of the language. Hence, a higher level grammar may be considered. It may be type 2, type 1, or type 0. The factors to be considered in the selection of the grammar are influenced by the ease with which it can be implemented and its capability to support the requirements of the language designer. Many researchers, compiler developers, and language designers have agreed on type 2 as the grammar for specifying the structure of the language, com-promising on the power of the grammar and the ease of its implementation.

  • Introduction to Compilers 39

    Following the decision of selecting the grammar, the modules associated with the rec-ognition of these programming language constructs are to be identified. They are scan-ners and parsers. A scanner scans the source program character by character and identi-fies the token conforming to type 3 or RG. A parser takes the sequence of tokens as input and parses them. We will list out the phases of the compiler shortly in this chapter, where the first phase is the scanner and the second phase is the parser, which is written to rec-ognize the language whose structure is specified by type 2 grammar or CFG.

    Type 2 grammar is syntactic grammar that specifies only the syntax or the structure of the language, not the meaning associated with its representations. Higher level gram-mars are possible for language specifications with increased complexity in implementa-tion. The compiler developers are not taking the risk of implementing the higher level grammar because of a lot of uncertainties arising in the specification and recognition of the language. Hence, in terms of compilers only, type 3 and type 2 are of interest to the programming language developer and compiler writer.

    2.3.3 Grammar of a Language

    As discussed in Section 2.2, the formal definition of a grammar is that it is a collection of a finite set of terminals, finite set of non-terminals, finite set of rewriting or production rules, and a start symbol.

    A source program consists of zero or more number of statements. If we use the nota-tion S for a statement and S

    s for statements, the program can be conveniently represented

    as follows:S

    s S

    s S | where stands for empty symbol. S may be any valid single statement.

    S Se | Sd where Se and Sd are the declarative statement and executable statement,

    respectively.

    ExamplE 2.1 An arithmetic expression is formed by the operators applied on oper-ands. Operands have the format of factors, terms, and expressions. The following set of production rules represents the grammar for an assignment statement.

    1. Se id = expr 5. term fact2. expr expr + term 6. fact (expr)3. expr term 7. fact id4. term term * fact 8. fact const

    The set of rules given in Example 2.1 constitutes a grammar for an assignment statement of the form lvalue = rvalue. Here, the lvalue represents the LHS of an assignment, which refers to a memory location. The rvalue refers to the RHS of an assignment statement, which refers to a value as a value of an identifier, constant, or evaluation of an expres-sion. In this grammar, S

    e is the start symbol, which is also a non-terminal. The words

    expr, term, and fact are the other non-terminals of the grammar. The words id and const are the terminals. The symbols =, +, and * are the assignment, additive, and mul-tiplicative operators, respectively. The operators (and) give priority to the embedded

  • Compiler Design40

    expression to be parsed first. The | is the alternative grammar specification operator, which is a form of selection operator.

    In a similar manner, the grammar for any language constructs can be specified. A detailed description of all these grammar symbols is given in Chapter 4. In Section 2.4, we discuss the closeness with which the grammar and languages are described with the computer systems.

    2.4 EVOLUTION OF COMPILERS

    The development of the compiler did not happen with a sharp boundary in time. Owing to the growth of digital electronics, its affordable prices, and its potential use in different applications, people started using computers aggressively. However, with the limitation of hardware and its interface with the external world, researchers could not think of develop-ing a compiler till the late 1950s. After the usage of assembly language with processor architecture, gradually the programmers felt the need for an easy language to deal with digital hardware, which opened the door for compiler development.

    2.4.1 History of CompilersAlthough programming languages were thought of before the development of compil-ers, they could not be developed for want of resources. Alonzo Church and Stephen Cole Kleene developed lambda calculus in the year 1930, which was the worlds first program-ming language. However, it was intended to model computation rather than being a means for programmers to form a programming language for the computer system. In the early 1950s, Grace Murray Hopper coined the term compiler. Then it was called automatic programming.

    FORmula TRANslation (FORTRAN) was the first programming language, and it was very popular for scientific applications developed by a team of IBM researchers led by John Backus from 1954 to 1957. Following the success of FORTRAN, a committee was formed to develop a universal computer language. The committee developed an algorithmic language called ALGOL58. A team led by John McCarthy of Massachusetts Institute of Technology (MIT) developed the LISt Processing (LISP) programming lan-guage. LISP was based on the lambda calculus. Programmers were successful in using LISP in list processing with a focus on intelligent processing. After 1960, the develop-ment of compilers gained momentum.

    The important events in the history of programming language theory are as follows:

    1. The basic levels of formal languages (and its associated grammar) were proposed by Noam Chomsky in the 1950s. It was later called Chomsky hierarchy of languages in finite automata theory and in the field of compiler design.

    2. Ole-Johan Dahl and Kristen Nygaard developed a language called Simula in the 1960s. It was the first object-oriented programming (OOP) language. In addition, it introduced the concept of co-routines.

    3. The following were the developments in the 1970s:

  • Introduction to Compilers 41

    (a) The object-oriented language Smalltalk was developed by a team of scientists at Xerox PARC led by Alan Kay.

    (b) Sussman and Steele developed the Scheme programming language, called LISP incorporating lexical scoping, a unified namespace, and elements from the Actor model.

    (c) Logic programming and Prolog were developed, allowing computer programs to be expressed as mathematical logic.

    (d) In 1977, Backus brought out the limitations of the current state of industrial languages in the ACM Turing Award lecture, exhibiting his new proposal by highlighting the features of the function-level programming languages.

    (e) In 1980, Robin Milner introduced the calculus of communicating systems called calculi. C.A.R. Hoare brought out the communicating sequential pro-cesses (CSP) model. It formed a strong foundation for the representation of the finite state machine for any kind of process development applications, includ-ing the concurrent operation.

    4. The following were the developments in the 1980s:(a) Bertrand Meyer created the methodology design by contract and incorporated

    it into the Eiffel programming language.

    5. The following were the developments in the 1990s: (a) Gregor Kiczales, Jim Des Rivieres, and Daniel G. Bobrow published the

    book titled The Art of the Metaobject Protocol, which deals with LISP and its extension.

    (b) Philip Wadler introduced the use of programming templates for structured pro-grams written in functional programming languages.

    In addition to these, many other languages were evolving over time, and some of them were general-purpose languages. Some languages are meant for specific purposes. For example, Ada language has some features as listed here:

    1. It is structured and statically typed.2. It is imperative.3. It is an OOP language.4. It is extended from Pascal and other languages. 5. During 19771983, Jean Ichbiah and his team designed Ada for its use in the US

    Department of Defense.

    A critical application, such as avionics, was well supported by the Ada language since it was believed that Ada is a reliable language. Ada has undergone many revisions, and its latest version is Ada-2011.

    In a similar manner, the COmmon Business-Oriented Language (COBOL) is purely meant for business applications, focusing more on the data designed for data processing. The standard edition of COBOL was released in 1960. It is still used in many mainte-nance applications. The recent version of COBOL 2002 supports OOP.

  • Compiler Design42

    In 1964, John George Kemeny and Thomas Eugene Kurtz at Dartmouth College in New Hampshire, USA, designed a very simple language called Beginners All-purpose Symbolic Instruction Code (BASIC), which was very popular till the late 1980s.

    Popular and general-purpose languages such as C, C++, Java, and C# were introduced in the following timeline:

    1. C was invented in 1972 by Dennis Ritchie at the Bell Telephone Laboratories.

    2. In 1979, Bjarne Stroustrup at Bell Labs developed a language that was an enhance-ment of the C programming language and named it C with Classes, which was later renamed as C++ in year 1983.

    3. In 1995, James Gosling at Sun Microsystems developed a popular language called Java. It is designed for supporting internetworking applications. In 2010, Oracle took over Sun Microsystems.

    4. Microsoft developed a language called C# (pronounced see sharp) to work under the .NET platform. Similar to Java, C# is also designed to produce a common inter-mediate language and can be ported to multiple platforms with many featuresthat is, it is object-oriented, component-oriented, and generic.

    5. Hyper text markup language (HTML) is a markup language used to present mes-sages on the Internet.

    6. Extensible markup language (XML) is a plain ASCII text language used to represent messages to be transported across computer systems working with any platform.

    2.4.2 Development of Compilers

    In the development of the compiler process, we have to identify the requirements specifi-cations. Before doing so, we have to understand the structure of the language and purpose for which it is being developed. Having studied the anatomy of the language, it is required to analyse the various issues associated with the development of the compiler. Typical issues from the perspective of a language developer are as follows:

    1. Location of the source code (from keyboard, file, or socket)2. Types of data supported

    (a) Basic data types (Boolean, character, integer, real, etc.)(b) Qualified data types (short, long, signed, unsigned, etc.)(c) Derived data types (record, 1D array, 2D array, file, pointer, etc.)

    3. Types of constants (Boolean, char, string, etc.)4. Representation of variables 5. Size of each data supported6. Scope of variables (static, dynamic)7. Lifetime of variables (local, global, external)8. Interface with other compiled codes9. Error reports that can be produced

    10. Decision about the operating environment (Microsoft, Unix variants)

  • Introduction to Compilers 43

    11. Ability to support parallel processing. If able, then the method of separation of the units that are to be run in parallel.

    The typical issues associated with compilers are as follows:

    1. How to read the source code2. How to represent the source code3. How to separate the tokens4. What data structures can be used for storing variable information5. How to store them in memory (code, stack, or heap areas)6. How to manage the storage during run-time7. How to prepare the errors linked with multiple lines8. To what extent semantic checking can be done9. What IC can be preferred

    10. Where to introduce the optimization process11. Mapping of IC to machine code12. Interface with host operating system for any parallel processing support

    Having gone through the various issues on both sides, that is, the sides of the language developer and compiler developer, various modules have to be identified. Then the func-tion of each module has to be outlined. Decide the interface between each function so that the development is scalable. The development of the compiler is not a one-time process. It has to be upgradable. Assign each module to the right team where their expertise matches with the requirements. For example, members who are involved in code generation must have a thorough knowledge of the machine architecture. Divide the modules into sub-modules so that they are manageable. Now each team can work concurrently, with proper interaction among them at the appropriate time.

    The compiler has evolved very slowly, working with languages ranging from very sim-ple to the more recent sophisticated languages. Initially, the different phases (discussed in Section 2.5) are tested, producing their output in the secondary storage and reading again before proceeding through the next phase.

    In addition, the development of each phase is monotonous work, and many tools have been developed and are available in the market (some tools are open source) for imple-menting them with ease. Hence, the possibilities of exploring the tools must be studied. Use of these tools enabled rapid compiler development.

    2.5 STAGES OF COMPILATION

    We have seen in Chapter 1 that a compiler is a system software that translates the source code into a target code. Figure 2.2 shows the surface-level or the user-level view of the compiler.

    The input is a source code. It can be any traditional programming language such as FORTRAN or COBOL. The target code is an object code for a particular machine archi-tecture. The design of the compiler is broadly divided into two parts: front end and back end. The front end of the compiler focuses on analysing the source code. It scans the

  • Compiler Design44

    source code character by character and separates the source code into a list of tokens asso-ciated with its attributes. Following the scanning process, it checks the source code for its structure as described by the grammar of the source code. If it is not successful, it reports the error to the user and terminates the compilation process. Otherwise, it produces an IC. It is called an IC because the back end of the compiler makes use of this code for further processing.

    The back end of the compiler takes the IC as an input and produces the machine code or translates it into any other target code as specified by the designer. Generally, the front end of the compiler is called the analysis part of the compiler, and the back end of the compiler is called the synthesis part of the compiler. The structure of the compiler is shown in Fig. 2.3.

    These two parts, the front and back ends of the compiler, are further divided into dif-ferent phases as shown in Fig. 2.4.

    Each stage has its own significance and techniques associated with it. The functions of each phase are outlined in the following sections.

    Fig. 2.2 User-level view of a compiler

    Sourcecode

    Targetcode

    Compiler

    Error messages

    Fig. 2.3 Structure of a compiler

    Source code

    Compiler front end

    IC

    Loop optimizationRegister allocationCode generation Code scheduling

    Machine code

  • Introduction to Compilers 45

    Fig. 2.4 Stages of compiler design

    6RXUFHSURJUDP

    /H[LFDODQDO\VHU6FDQQHU

    7RNHQV

    6\QWD[DQDO\VHU3DUVHU

    3DUVHWUHH

    6HPDQWLFDQDO\VHU

    $EVWUDFWV\QWD[WUHH

    ,QWHUPHGLDWHFRGH,&JHQHUDWRU

    1RQRSWLPL]HG,&

    2SWLPL]HG,&

    7DUJHWFRGHJHQHUDWRU

    7DUJHWPDFKLQHFRGH

    ,&RSWLPL]HU

  • Compiler Design46

    2.5.1 Lexical Analysis

    A lexical analyser or scanner is the fi rst phase of the compiler. As the name implies, the scanner scans the source code character by character delimited by some white-space characters, operators, and punctuators, and separates the tokens.

    For example, consider the source code segment D = A + B * C. As programmers, we know that A, B, C, and D are variables or identifi ers and =, +, and * are the arithmetic operators. The functions of the scanner are to scan through this programming statement and separate the tokens. Here, the delimiters are the operators. The output of the scanner program is shown in Table 2.3.

    ExamplE 2.2 Consider a statement Area = 1/2 * base * height. The scanner program in this example uses the white-space character and operator as delimiters and separates the tokens as follows:

    id: Area, base, heightop: =, /, *nconst: 1, 2

    Note that during the scanning process, the scanning program separates the tokens, assigns each token its type, and stores its lexeme value in a place holder. Hence, a token in a source code is represented as a pair .

    ExamplE 2.3 Consider the source code statement given in Example 2.2 and present the outputs in a sequence of tokens in suitable format.

    As done in Example 2.1, the scanner program uses the white space and operator as delimiting characters and produces the token sequence as follows:

    Will there be too many types in the source code that they are not manageable? Let us analyse a source program and its types of tokens. Any programming language will have a limited number of keywords. These keywords can be given a type number in the order (1, ..., n), where n is the number of keywords. Following these, there are only limited token types such as identifi ers, constants, operators, and punctuators. They will be repre-sented by the token type and the tokens lexeme value.

    Table 2.3 Output of the scanner program

    Token no. 1 2 3 4 5 6 7

    Token type id op id op id op idToken or lexeme value

    D = A + B * C

    Note: id stands for identifi ers or the variables; op stands for the operators in the source code.

  • Introduction to Compilers 47

    If a scanner program is implemented in the C language, one may use the following strategies for giving the token number.

    #define char 1

    #define int 2

    #define float 3

    ...

    ...

    #define id 28

    #define op 29

    #define nConst 30

    #define op 31

    ...

    ...

    ExamplE 2.4 Consider the following program segment. The token output sequence will be annotated with their token representations as follows:

    int A, B;

    float C;

    C = A * B;

    In an actual implementation, each operator will also be assigned with numerical values. The lexeme value will be replaced with the location of the identifier. During the parsing process, each token with its attribute is passed to the parser, and necessary routines will be called to store the identifier in the symbol table. The symbol table is implemented with a suitable data structure to hold information about the identifier in the source code.

    What is the format for the tokens? How are they specified? They are specified by regu-lar expressions, RG, or type 3 grammar, which will be dealt with in detail in Chapter 3. The separated tokens along with their attributes will be passed to the next phase of the compiler called parser or the syntax analyser.

    2.5.2 Syntactic AnalysisThe syntax analyser, also called parser, is responsible for reading the sentence of the language and checking for its structure as specified by the grammar. There are various structures supported in a language. They are broadly classified into executable and non-executable statements. Executable statements are statements that will be executed by the processor during run-time.

    For example, the statement C = A + B is an executable one, where the values of the variables A and B are added and assigned to the variable C during run-time. On the other hand, consider a statement int A, B, C in a C-like language; it is not executed during run-time. Instead, the variables are processed during compile time, and the information about these variables is stored in a symbol table to be referred to during run-time. These statements are called declarative statements.

  • Compiler Design48

    There are different types of statements in these two categories:

    1. Declarative statement2. Assignment statement (S

    a)

    (a) lvalue = expr3. Control statement

    (a) Selective statement (i) if statement (Sif) (ii) if-then-else statement (Sie)(iii) switch case (S

    sc)

    (b) Iterative statement (i) for statement (Sfor) (ii) while statement (S

    while)(iii) repeat while or do-while statement (Sdw)

    (c) goto statement (Sgo)4. IO statement (Sio)

    St represents the statement of type t, such as if, if-else, while, or do-while.Each statement has its own syntax and function. For example, the syntax for the if-else

    statement is

    if expr statement1 else statement2Again, the statement can be any one of these two categories (declarative or execut-

    able). How do we write the grammar? As we have seen in the beginning of this chapter, a grammar is represented by four tuples with a finite set of terminals, finite set of non-terminals, finite set of production rules, and a start symbol. Each statement is specified by one or more production rules.

    ExamplE 2.5 Using the notation Se for the executable statement in general, we can

    write the syntax for various types of executable statements.

    Se S

    a | Sif | Sie | Ssc | Sfor | Swhile | Sdw | Sgo

    The syntax for each statement is

    Sa id = expr

    Sif if SeSie if Se else SeS

    sc switch expr S

    case

    Sfor for exprinit exprcheck exprincrdecr SeS

    while while expr SeSdw do Se while exprSgo goto LS

    case S

    case case expr S

    e |

  • Introduction to Compilers 49

    where init refers to the initial condition of the expression, check refers to the condition checking for the continuation/termination of the loop, and incrdecr refers to the incre-ment/decrement of the expression used to check the bounding condition of the loop.

    All statements and their production rules or the rewriting rules are self-explanatory. The typical production rules for the arithmetic expression are given in Example 2.1.

    Among the four formal grammars (type 0 through type 3), type 3 grammar is suitable for specifying the structure of the token since it is simple and powerful enough to repre-sent the token. However, it is not sufficient to specify the structure of the sentence. Hence, it is preferred to use type 2 or CFG. Should we use type 1 or type 0 grammar? Of course, we can use type 1 or type 0 grammar without loss of precision. However, it is relatively more difficult and complex to write the recognizer for types 1 or 0 than type 2 grammars. In addition, type 0 and type 1 grammars are not required to specify the language structure at the syntactic level. Hence, in terms of compilers, we are restricted to use only type 3 and type 2 grammars.

    Recently, attempts are being made to develop recognizers for languages described by type 1 grammar. So far, we have focused on specifying the grammar for the language. Let us turn our attention to developing the recognizer or parser for this language.

    The objective of the parser is to scan through the source code by calling the scanner routine and checking whether the sequence of tokens is in the order specified by the gram-mar. If it is successful, it produces the IC; else, it reports an error.

    By notation, the CFG rule is denoted by

    A

    where stands for a string of grammar symbols, which means it is either a terminal or non-terminal.

    For example, if-then-else is denoted by

    Sie if expr Se else SeMapping the rule in terms of terminal and non-terminal, we get

    N T N N T NRepresenting the terminal or non-terminal (grammar symbol) as

    N X1 X2 X3 X4 X5, where Xi stands for the grammar symbol representing either ter-minal or non-terminal

    Using the notations

    A , where A stands for non-terminal (N) and stands for the string of grammar symbols X1 X2 X3 X4 X5. Hence, any CFG can be presented in this form.

    The parser is broadly classified into two categories:

    1. Top-down parser 2. Bottom-up parser

  • Compiler Design50

    If a given language conforms to the grammar of the language or if the grammar is able to produce the given language, we call it successful parsing.

    The top-down parser expands the start symbol of the grammar and consecutively all the non-terminals in the RHS of the rule. At every point of expansion or derivation of the production, if the final string obtained matches with the given source code, the parsing is said to be complete.

    ExamplE 2.6 Consider the statement if (a < 10) c = a + b else c = a b for parsing. In this example, there is one Boolean expression and two assignment statements. Both assignment statements are executable statements denoted by S

    e. In addition, the Boolean

    expression is derivable from the rule for expression. In Example 2.1, we had the rules for arithmetic expression, which can be expanded to Boolean expressions also. The deriva-tion steps are

    Sie if expr Se else Se if (a < 10) c = a + b else c = a b.It has the analogy of the derivations: A X1 X2 X3 X4 X5

    In Example 2.6, we have started from the non-terminal Sie, which is also the start symbol for the if-else statements. At every production process, a non-terminal is replaced by one of the RHS values of the production rules. Finally, the string that we get is the sentence of the language. In the derivation process, the last string is if (a < 10) c = a + b else c = a b. All the other intermediate steps are called sentential forms of the language. In other words, the sentential form with only terminals is called sentence of the language.

    So far, we have seen how a given sentence can be parsed using the top-down parser. Alternatively, we can use the bottom-up parser. As the name implies, we can consider the given sentence of the language, scan left to right, and identify the appropriate RHS of any one of the given grammar rules (we call it handle) and replace it by the LHS of that rule. Again we find the handle and replace it with the non-terminal in the LHS of the grammar rule. This process is repeated until we get the start symbol of the grammar.

    However, identifying the exact handle is the issue of the bottom-up parser. The process of identifying the exact handle will be discussed in detail in Chapter 4 on parsing tech-niques. Consider the grammar for the arithmetic expression:

    1. expr expr + term 5. fact (expr)2. expr term 6. fact id3. term term * fact 7. fact const4. term factIn this grammar, the appropriate handles are all the RHS items. In general, it is nothing

    but the notation of the form A . Here, is the string of the grammar symbols that will be found in the sentential form in the parsing process and will be replaced by A.

    ExamplE 2.7 Consider the expression A + B. Here, both A and B are variables or the identifiers for some memory locations. We denote them by id. Hence, it is internally represented by id + id.

  • Introduction to Compilers 51

    If we scan this sentence from left to right, we find that id is the RHS of rule 6. Hence, it is replaced by fact. Hence, the steps are

    id + id fact + id term + id expr + id expr + fact expr + term expr

    This is the start symbol for the expression.

    We have taken the simplest example for doing the parsing. In the real programming envi-ronment, there are varieties of expressions, and different combinations of operands are possible.

    ExamplE 2.8 Consider an arithmetic expression A + B * C. As in the previous exam-ple it is internally represented as id + id + id. The parsing process is

    id + id * id fact + id * id term + id * id expr + id * id expr + fact * id expr + term * id expr + expr * id expr + expr * fact expr + expr * term expr + term expr

    This is the start symbol for the expression.

    In Examples 2.7 and 2.8, for the given language, bottom-up parsing is successful.Let us write down the rules for the arithmetic expression in a slightly different and valid

    form as follows:

    1. expr expr + expr 3. expr (expr)2. expr expr * expr 4. expr id

    Let us work for the example given in Example 2.7.A + B is mapped to id + id and the bottom-up parsing isid + id expr + id expr + expr expr, , which is the start symbol.For the example A + B * C, it is mapped to id + id * id, and the parsing is

    id + id * id expr + id * id expr + expr * id expr * id expr * expr expr, which is the start symbol.

    Here, the process of parsing is successful. However, the meaning associated with the semantic actions of this parsing is not present. In the process of arithmetic expression, we give precedence to the evaluation of the expression of the form E * E rather than E + E. In the sample example it has happened in the other way. expr + expr has been parsed first, before expr * expr. One solution is to write down the grammar rules in a different order. However, this is not practical in a design environment involving a larger number of rules. Therefore, this kind of grammar (ambiguous grammar) must be modified. Elaborate dis-cussions on using the ambiguous grammar have been presented in Chapter 4.

  • Compiler Design52

    2.5.3 Semantic Analysis

    The CFG discussed in Section 2.5.2 is able to check with the structure of the language. It does not take care of what and how operands are worked on by operators. How does the system operate with two operands of different types? One of the issues is the size of the operand. For example, consider the following code segment:

    int i,j;

    short int si,sj;

    i = 10;

    j = BIGINTEGER;

    si = 10; // Case 1: No loss of data

    sj = j; // Case 2: Loss of data

    As the operands of the large-sized data (e.g., 4 bytes) are assigned to lower sized data (e.g., 2 bytes), there are chances of loss of data depending on the value of data. However, the burden of checking with the values of each data cannot be put on the programmer. Hence, a good compiler is expected to analyse this kind of statement and report errors.

    Consider a situation where an evaluation of an arithmetic expression 5/2 is to be car-ried out. It gives the answer 2, since both operands are integers. Hence, integer division truncates the remainder. However, in many situations, the programmer is interested in working with mixed data types. For example, consider 5.0/2. Is it 2.5 or 2? It depends on how the compiler works. Most of the machine architectures support the operations on the operands with similar kind of data. If this is not the case, the compiler has to check for the type of compatible operands.

    Consider the example of operating with numeric elements such that one operand is of the integer type and other is of the real type. Following the type compatibility checking, size compatibility checking has to be done. In such cases, it is preferred by the program-mer and language designer that both data types be unified to be of a high-level type. If one data is an integer and other real, then the integer data will be converted into real data (which has large data size). This process is called data promotion. This will eliminate the loss of data.

    ExamplE 2.9 For the following code segment, data conversion takes place to pro-mote the lower size data to equivalent compatible peer size data for operation.

    int a, b;

    float fa;

    a = 10;

    fa = 12.5

    b = a + fa;

    There are three executable and two declarative statements in this example. Consider the third assignment statement. Here, an integer data is added with real data. In the bottom-up parsing process, the compiler has to reduce the sentence of the language (assignment statement) to the start symbol of the assignment statement. Since the operands in the expression in the RHS of the rule have mixed operands, type conversion has to be carried out. Let us use the suffix notations i and f to denote integer and real, respectively.

  • Introduction to Compilers 53

    b = a + fa is mapped idi = idi + idf and using the rules given in Example 2.1.

    idi = idi + idf idi = factori + idf idi = factori + factorf idi = termi + factorf idi = expri + factorf idi = expri+ termf idi(converInt2Float)expri + termf

    // There is no error till this step. idi = expri // This step shows error message.

    The expression with real value attribute in the RHS is to be assigned to the identifier with the integer value attribute. Here, either an error or a warning message has to be displayed.

    However, if a type cast operator is provided in the assignment statement of the previous example the warning or error message need not be shown, and it will be successfully parsed.

    ExamplE 2.10 If the third assignment statement in Example 2.9 is modified asb = (int) a + fa; then the process will be as follows:idi = idi + idf idi = factori + idf idi = factori + factorf

    idi = termi + factorf idi = expri + factorf idi = expri+ termf idi =(converInt2Float)expri + termf idi = (convertFloat2Int) expri

    In Example 2.10, the function convertInt2Float converts the size of the data from integer to float type. The function convertFloat2Int reduces the size of the data from float to inte-ger type. The first type of conversion is called implicit type conversion. The second type of conversion is called explicit type conversion in a compiler perspective.

    Since we are using the CFG, a separate phase performing semantic analysis is required. If we use CSG, these issues will be handled as part of the parsing process. However, implementing a language supported by CSG is cumbersome, and many of the programming languages can be restricted to be supported by CFG. The semantic analyser becomes the supplementary module to the compiler. In recent languages, even the explicit type conversion routines show warnings to alert the user. Therefore, writing a compiler is not only a technique but also an art to study the various conveniences of programmers and assist them.

    Similarly, there are many semantic issues in the design of a compiler. They are dealt with in detail in Chapter 6.

    2.5.4 Intermediate Code Generation

    As soon as the syntax and semantic analysis is carried out successfully, the source code is represented in a form applicable to the next phases. We have already discussed that lexical analysis, syntax analysis, and semantic analysis are parts of the front end of the compiler. The front end of the compiler transforms the source code into an IC. Why is the compiler designed to generate IC and not directly machine code?

    There are several reasons for generating IC before creating machine code.

  • Compiler Design54

    Modular DesignWriting the compiler can be in monolithic form, where there will not be any flexibility of making changes. In addition, the front and back ends of the compiler can be isolated. Here, the analysis part generates the IC and the synthesis part works only on the IC; it has no reference to the source code. At the same time, the front end does not depend on the back end, that is, on the target machine.

    Refer to Figs 2.5 and 2.6. In the first case, that is, without using the IC each source code is mapped to each target code. If the number of source languages are m and the target languages are n, then the complexity of the compiler development is m * n. For a large number of source languages and machine languages the complexity of writing compiler is very large to manage. On the other hand, if IC is used, the number of mappings from m source languages to n target languages is only m + n.

    Keeping these factors in mind, Sun Microsystems introduced the concept of byte code, which has a standard format in releasing the Java language, which is popular worldwide. Byte code is something similar to IC with a standard being supported in almost all platforms. To convert byte code to any machine code there is a special software called Java virtual machine (JVM) available in each target machine depending on the machine. Because of these reasons, a byte code developed using Java language in one platform can be ported to any other platform without any difficulty.

    Fig. 2.5 Compilations without IC

    6RXUHFRGH 6RXUHFRGHP

    0DFKLQHFRGH 0DFKLQHFRGHQ

    Fig. 2.6 Compilations with IC

    Soure code1

    Intermediate code (IC)

    Soure codem

    Machine code1 Machine coden

  • Introduction to Compilers 55

    Following Sun Java, Microsoft also introduced a special language called C# or CSharp to work in an Internet-like environment in the .NET platform. It has common language run-time (CLR) environment, which is usable by different programming languages.

    advantages IC with modified features has played a major role in the Internet era. It supports many features as follows:

    1. Consistent and continuous programming model2. Develop once and run anywhere3. Simplified deployment4. Wide platform reach5. Programming language integration6. Simplified code reuse7. Interoperability

    What could be the format by which IC can be generated? Many forms of IC have been used over the years. A few of them are as follows:

    1. Syntax trees2. Three-address code

    (a) Quadruple(b) Triple(c) Indirect triple

    3. Any valid and usable IC

    The syntax tree represents the source code in a tree-like structure.For example, consider an arithmetic expression A + B * C that is mapped to id + id *

    id and is represented in the form of a tree.The syntax tree is the tree representation of the source code after the syntax checking

    is completed. Figure 2.7 shows the syntax tree of the expression. This syntax tree is also called expression tree, labelled tree, or operator tree. Here, the interior nodes are opera-tors and the leaf nodes are operands. In addition, this tree is a binary tree. If we traverse the tree in a different order, we will get different expressions as follows:

    Fig. 2.7 Syntax tree

    +

    *id

    CB

    A id id

  • Compiler Design56

    Preorder traversal gives the prefix expression: + A * BCInorder traversal gives the infix expression: A + B * CPostorder traversal gives the postfix expression: ABC * +If we start processing the postfix expression we will get the three-address code as follows:

    t1 = B * Ct2 = A + t1Why is it called three-address code? Two addresses are used for storing two operands

    in the RHS. The third address is used to store the result. The details of the various three-address codes are covered in Chapter 6.

    2.5.5 Code Optimization

    Optimization is the process of improving the objective function in a non-polynomial problem. From a compiler point of view, the objective of optimization is to reduce the time requirement and/or space requirements. Recent compilers have options of optimiza-tion. During the development of the application code, the optimization will normally be disabled. This mode is called debug mode. During product delivery, the code will be in release mode after subjecting it to many optimization processes. The internal architec-ture varies from one processor to another. The compiler developer working in the front end need not be concerned with the internal architecture of the processor. Some proces-sors handle certain types of code much better than others. In addition, some processors have special performance-enhancing features, but in order to use them the code must be arranged in a compatible way. For example, some processors are designed to work well with strings, some with numbers. For the same operations, two processors may support different types of instructions, with different timing and memory requirements. Programs that take these issues into account can structure themselves to get more performance than those that do not optimize themselves in this manner.

    In addition, users are working on different platforms with complex business require-ments. Compilers must be able to handle all these for their usability. Optimization could be done at various levels. Even programmers can take care of these issues while develop-ing the source code.

    ExamplE 2.11 Consider the following statements in the for loop. for(i=0;i

  • Introduction to Compilers 57

    // Some statements

    }

    Now the gain in computational time is (N * M 1) times in terms of multiplications.

    From the software development point of view, the user will not be able to concentrate on all these issues since he or she focuses only on the business logic. Hence, a good compiler is expected to identify the scope for code development and achieve it. The main areas of optimization mostly occur in the following:

    1. Source code 2. Intermediate code 3. Machine code

    The optimization in the source code will not come under the purview of the compiler. However, the optimization can be either in the IC or in the machine code. However, working with machine code is cumbersome since it is in binary language. In addition, all non-optimized codes in the early stages of compilation will accumulate in the later stages of the compiler. Hence, it is preferable to work on the optimization of code in the IC level because of the following reasons:

    1. Understanding IC is easier than machine code.2. Exiting techniques in optimization can be directly deployed in IC than machine

    code.

    The process of optimization can be done in code and loop level. Code-level optimiza-tion will take care of the following:

    1. Reduction in cost by using the appropriate equivalent operations2. Elimination of repeated calculations3. Elimination of common sub-expression evaluations4. Identification and removal of unreachable code

    Loop optimization will focus more on the control of execution of the statements inside the loop. From the statistics of the programming environment, it is understood that most of the real-world programs have 80 per cent of the execution logic inside the loop. If the focus is on loop optimization, the efficiency of execution can be improved in a consid-erable manner. Optimization is dealt with in detail in Chapter 7. After the optimization process, the code generation process can be initiated.

    2.5.6 Code Generation

    The final phase of the compiler is code generation. It has two views. On one hand, the process of generating the machine code from the source code looks very complex. On the other hand, if a standard format for the IC is derived and templates are made available for the translation process, it is only mapping of the IC to machine code. Here, the require-ment is that one has to be familiar with the complete instruction set of the processor to decide which code is efficient for the given IC. As discussed in Chapter 6, every construct of the programming language can be brought into a few IC categories. They can be con-verted into machine language using specialized instructions meant for this purpose.

  • Compiler Design58

    ExamplE 2.12 Consider an arithmetic expression D = A + B * C. It is mapped inter-nally as id = id + id * id. The ICs generated are the following:

    1. T1 = B * C //Parsed IC was generated2. T2 = A + T1 //Making use of the previous value T1 and A3. D = T2 //Assignment of the computed value to result variable D

    If we translate each IC into machine code directly, we will get the machine code as

    T1 = B * C MOV R1, B T2 = A + T1 MOV R1, A

    MOV R2, C MOV R2, T1

    ADD R1, R2 ADD R1, R2

    MOV T1, R1 MOV T2, R1

    D = T2 MOV R1, T2

    MOV D, R1

    Code generation is influenced by many factors as listed here.

    1. Register allocation 5. Instruction format2. Register scheduling 6. Power of instructions3. Code selection 7. Optimization at the machine code level4. Addressing modes 8. Back patching

    The developer concentrating on code generation should be an expert to take care of all these issues. Chapter 8 discusses code generation in detail.

    2.5.7 Symbol Table Management

    The purpose of the programming language is to get some work done by the computer. Initially, the compiler was focusing only on evaluation of arithmetic expression involv-ing operators and operands. The operands are all of different types and sizes, and they need different types of storage area. The symbols or the variables are the core part of the expressions. Hence, the identifiers are required during the run-time. The questions to be answered are as follows:

    1. Where are they stored? 4. What is their format?2. When are they stored? 5. What data structure can be used?3. How are they accessed?

    When the variables are declared, the two phases process these statements. For exam-ple, consider two variables a and b, which are declared as integers.

    int a, b;

    During the scanning operations, only the tokens are separated. The token separated is int, which is uniquely identified as a keyword, followed by a list of variables a and b.

    Only during the parsing process, the relationship between the int keyword and the vari-able list a, b is established. At this time, a procedure has to be invoked to store a variable or the symbol in the program. The location for storing these variables is decided by the scope of the variables. If it is a static variable, then it is stored in the heap memory of the system.

  • Introduction to Compilers 59

    If they are dynamic variables, they are stored in the stack. Each language supports differ-ent types of scope and lifetime of the variables. Thus, the symbol tables are created during the parsing process and used through the later part of the compilation process.

    Each symbol is associated with a minimum set of attributes as listed here.

    1. Name 3. Size2. Type 4. Location

    Since every symbol has a set of attributes, it can be realized as a record of information. Since there are multiple variables in a source code, it can be realized as a set of records. Any suitable data structure for the set can be made use of. Examples of such data struc-tures are as follows:

    1. Array of records 3. Tree of records (binary search tree, B-tree, etc.)2. Linked list of records 4. Hash data structure, and so on

    Each data structure has its own advantages and disadvantages. It can be selected depending on the type of language.

    ExamplE 2.13 Consider a program segment having the declarative statements: int principal;

    fl oat rate, interest;

    A typical data structure and the simplest for holding information of these variables is

    #defi ne NAMESIZE 30

    #defi ne MAXSYMBOL 50

    typedef struct symbol

    {

    char name[NAMESIZE];

    int type;

    int size;

    void *location;

    } symbol;

    symbol symbols[MAXSYMBOLS];

    The simplest data structure is an array of records of symbol information (Table 2.4).

    This symbol table is created during the syntax and semantic analysis phase and is referred to by other phases in the compilation process.

    Table 2.4 Data structure

    Name Type Size Value/location

    Principal 1 4 Pointer to memory

    Rate 2 8 Pointer to memory

    Interest 2 8 Pointer to memory

  • Compiler Design60

    A detailed description of symbol table management is given in Chapter 5.

    2.5.8 Error Management

    For any beginner in a programming environment, it is a nightmare to write an error-free program. Earlier compilers were not friendly enough to handle error reporting properly. Detecting, locating, and reporting errors and recovery from errors are the most important tasks to be carried out by the error-management module. Some of the desirable character-istics of error-management modules are as follows:

    1. The compiler should not crash on encountering an error.2. The compiler should report an error and recover from it for proceeding with other

    lines of code.3. The reported error must be meaningful.4. The error-correction module should not change the meaning of the intended

    operations.

    Some editors have built-in knowledge of language profi le and guide the developer in providing information so as to assist him/her to type without mistakes. In most of the inte-grated development environments (IDEs), the editor helps in matching the parenthesis, undeclared variables, and colouring the different types of programming constructs.

    Errors are broadly classifi ed into static (compile-time) errors and dynamic (run-time) errors. It is essential to report the errors as much as possible in the earlier stages. This helps in saving resources. In Java-like languages, the compile-time and run-time errors are clearly classifi ed into errors and exceptions, respectively.

    During the compilation process, errors may occur in more than one phase. Some of the errors are as follows:

    1. Lexical errors(a) Misspelling (b) Juxtaposing of characters

    2. Syntax errors (CFG error)(a) Unbalanced parenthesis (c) Missing punctuation operators(b) Undeclared variables

    3. Semantic errors (CSG error)(a) Truncation of results (b) Unreachable code

    All the errors in the later stage of the compilation process will lead to run-time errorsince it has nothing to do with the source code.

    Hence, the compiler translates the programming language into machine language. An open research issue would be to come up with ways of translating any language to anoth-er, thus eliminating the need to learn a foreign language.

    SUMMARY Languages are used for communication. A natural

    language defi nes a means of communication (for example, English, French, or Tamil) shared by a

    group of individuals without much restriction on their usage. On the other hand, a formal language has defi ned rules in the usage of the language.

  • Introduction to Compilers 61

    The grammar of a language specifi es the rules by which the constructs of the language can be formed.

    The components of languages are symbols, words, and sentences.

    For every formal language, there exists a fi nite set of rules or grammar. Every fi nite grammar has a language associated with it.

    Notations are a convenient means of representing the components of a language and grammar.

    Chomsky has categorized the languages into four: (a) type 0 or unrestricted grammar, (b) type 1 or CSG, (c) type 2 or CFG, and (d) type 3 or RG.

    Design of any programming language must be in such a way that it is easy and powerful enough for the user and practically possible to implement.

    Type 3 or RG is used to specify the tokens of the computer language, and type 2 or CFG is used to specify the structure of the sentence of the languages.

    When a language whose structure closely resem-bles the computer system is developed, it is fast to execute and infl exible. When it is closer to human language, it is slow for execution since there is a lot of overhead information associated with the process of translating human language to machine language.

    Owing to the speed of the hardware and the require-ment of fl exible language to be used in the busi-ness environment, it is preferred to write programs in high-level (3GL or 4GL) languages only, which necessitates the presence of a translator/compiler for converting them into machine language.

    Scripting languages are popular, as they are easy to understand and use.

    Special and unique languages are being developed to work with the Internet across the globe (HTML and XML).

    It will be interesting to communicate with the computer by speaking to it. Research is underway for using a computer to analyse and recognize speech.

    Programming languages have been evolving since the year 1950, with several capabilities being added over time. From FORTRAN to Java and C#, several useful features have been added.

    Compiler development must release the burden of the programmer on remembering and checking syntax, semantic issues, and so on.

    Development of the compiler is broadly divided into analysis phase (front end) and synthesis phase (back end).

    The various phases of the compilers are lexical analysis (scanner), syntax analysis (parser), seman-tic analysis, intermediate code generation, code optimization, and code generation.

    Tools are available for developing these phases to ease compiler development.

    We have broadly outlined in this chapter the programming language theory, construction of its recognizer, that is, compiler, and the stages of the compiler. Chapter 3 deals in detail with the fi rst phase of the compiler (scanner or lexical analyser).

    U 1. Pick the odd one out: (a) Scanner (c) Intermediate code (b) Parser (d) Code generatorU 2. Can we specify the format of the token using

    CFG? (a) Yes (b) NoU 3. Chat grammar is preferable to specify the syn-

    tax of a language? (a) Type 0 (c) Type 2 (b) Type 1 (d) Type 3U 4. What is the best grammar for specifying a natu-

    ral language?

    (a) Type 0 (c) Type 2 (b) Type 1 (d) Type 3U

    5. Rules are related to (a) data (c) knowledge (b) information (d) intelligenceU

    6. Inference is related to (a ) data (c) knowledge (b) information (d) intelligenceU

    7. What is the size of the language that a recursive grammar can generate?

    (a) Infi nite (c) All of these (b) Finite (d) None of these

    OBJECTIVE TYPE QUESTIONS

  • Compiler Design62

    REVIEw QUESTIONS

    U 1. Identify the following sentence as formal or informal language.

    (a) God is Love. (b) Find a watch in the market whose rate is less

    than 500. (c) Select pay details for the employees for the

    month of January 2010.l 2. List out the characters in a language of your

    choice. Find out how many two-character words can be coined by using these characters.

    a 3. Specify the rules for question 2.U 4. Write down the rule for the format of an identi-

    fier in the C language.U 5. Write down the grammar for type 0 grammar.U 6. Write down the CFG for Boolean expression.U 7. Compare the features of CFG with CSG.

    a 8. Refer to any book on programming in C# and identify the keywords and represent them in token pairs.

    l 9. How do you select a high-level language for a given problem?

    U 10. Give the choice of interpreted and compiled language.

    R 11. Enumerate the generation of languages.U 12. When do you use the FORTRAN language?U 13. For a business environment, which high-level

    language do you prefer?U 14. You have to write a device driver (say, for a

    mouse). Do you prefer assembly language or C language? Justify.

    U 15. Java is a platform-independent language: True or False? Justify your answer.

    U 8. Pick the odd one out: (a) Finite set of terminals (b) Finite set of non-terminals (c) Finite set of production rules (d) Sequence of wordsl 9. Which one has the same level of capabilities for

    a given language, and what is your choice of grammar in the implementation point of view?

    (a) Type 3 (c) All of these (b) Type 2 (d) None of theseU 10. Chich of the following grammars is more

    expressive? (a) Type 0 (c) Type 2 (b) Type 1 (d) Type 3U 11. lvalue in an assignment statement always refers

    to a location in memory. (a) True (b) FalseU 12. Which of these languages is preferred for fast

    response? (a) 1GL (b) 2GL (c) 3GL (d) 4GLU 13. Compiled languages are inferior to interpreted

    languages with respect to response time. (a) Yes (b) NoU 14. Is the memory requirement of a compiled lan-

    guage huge compared to an interpreted language? (a) Yes (b) No

    U 15. JVM stands for (a) Java virtual memory (b) Java virtual model (c) Java virtual machine (d) Java virtual methodU 16. COBOL stands for (a) COmmon Based Object Language (b) COmmon Basic Object Language (c) COmmon Basic-Oriented Language (d) COmmon Business-Oriented LanguageU 17. The output of syntax analyser/semantic analy-

    ser is (a) parse tree (b) intermediate code (c) sequence of tokens (d) none of theseR 18. The representation of token type is by (a) integer (b) float (c) double (d) charU 19. When operands are of two types, one having

    2 bytes of representation and another 4 bytes, what is the size of the resulting answer in general?

    (a) 2 bytes (b) 4 bytesU 20. Without creating IC the compiler cannot gener-

    ate machine code. (a) True (b) False

  • Introduction to Compilers 63

    U 16. Under what circumstances will you use HTML and XML?

    R 17. What are the features of the PHP language?U 18. When do you use GUI-based applications?a 19. What are the required steps in recognizing

    human speech? Search for some speech recog-nition articles before answering.

    U 20. What are the advantages of compiled languages over interpreted languages?

    U 21. What are the features of a good language?U 22. What are the features of a good compiler?

    R 23. List out some source languages that you can feed to a compiler.

    R 24. List out the target language that a compiler can generate.

    U 25. How are the sequences of tokens represented?U 26. Give an example for the while statement.l 27. Give the grammar for a nested while statement.

    Refer to Example 2.5 for a single while statement.U 28. Give the grammar for working with nested if-

    else statement. Do you find any difficulties in matching them?

    a 1. Using English grammar, give the derivation for the following sentences:

    (a) The cat drinks milk. (b) The monkey climbs a tree. (c) The player kicked the ball.a 2. Consider a grammar with the following rules

    and set of terminal symbols (0, 1): A 0 A A1 A 1 S 0S Describe the set of binary strings generated by

    this grammar.a 3. List out any five strings derivable from the fol-

    lowing grammar (a) S 0A1A A 1B B 1 (b) S 010 A 0 (c) A 0A1 B 0Ba 4. Parse the string bbba using the following

    grammar. (a) S bS B C (b) S ba B ba (c) S A B aCC (d) A a C CCa (e) A Bb C ba 5. Draw the derivation/production steps for the fol-

    lowing arithmetic expressions using the gram-mar for the arithmetic expression in the text:

    (a) p + q (c) (p + q) (b) p * q + r (d) (p + q) * ra 6. Identify start symbol, sentential form, handle,

    and sentence in the derivations steps obtained in the previous question.

    U 7. A binary language consisting of four 0s fol-lowed by eight 1s followed by four 0s is to be

    generated for a frame delimiter in a communi-cation link. Write down the grammar for this.

    a 8. Consider a code segment of C language: (a) int a, b; (b) a = 10; (c) b = 20; (d) if (a = 20) (i) a = a + 1; (e) else (i) a = a - 1; Is this syntactically and semantically right?

    Write down the value of the variable a after the execution of this code segment.

    U 9. For the problem given in question 8, if the code segment in the if statement is slightly modified as:

    (a) if(a == 20) (i) a = a + 1; (b) else (i) a = a 1; what would be the value of a after the code gets

    executed? Compare the result with the previous question and comment.

    U 10. There are four programming languages, name-ly, BASIC, C, Pascal, and FORTRAN for which the compiler has to be developed to work with three machine architectures, Intel, Motorola, and Zilog processor.

    (a) List out how many front ends and back ends of the compiler you have to work with.

    (b) Repeat the question for a compiler without using the IC where the source code will be

    ExERCISES

  • Compiler Design64

    directly converted into machine code (of course it is possible).

    U 11. Write down a grammar for representing the fol-lowing numbers.

    (a) 230 (b) 145.23 (c) 424 (d) 667e40a 12. Draw the syntax tree for the arithmetic expres-

    sion 1 + 5 * (4 + 6).l 13. Consider the following sentences and identify

    scope for optimization: Pay1 = FixedPay + NoOfWorkingHours * Rate Pay2 = Pay1 + NoOfWorkingHours * Rate *

    bonusRateForHours These computations are to be carried out for all

    the employees of an organization. Comment on

    the gain in execution time if the codes are prop-erly optimized.

    l 14. In a processor, a register is to be initialized to be zero. We have the following two codes to realize it. Which one would you prefer and why?

    (a) MOV A,00 //store a value 0 into the register:

    (b) XRA A //A A XOR A

    a 15. Generate the machine code for the following source code segment:

    TotalAmount = Principal + Principal * InterestRate

    answers to Objective Type Questions1. (c) 2. (a) 3. (c) 4. (a) 5. (c) 6. (d) 7. (a) 8. (d) 9. (a) 10. (a)

    11. (a) 12. (a) 13. (b) 14. (a) 15. (c) 16. (d) 17. (b) 18. (a) 19. (b) 20. (b)