topic 2: lexical analysis - uamarantxa.ii.uam.es/~modonnel/compilers/02_lexicalanalysis.pdf1...

1

Compilers

Topic 2: Lexical Analysis

Mick O´Donnell : [email protected]

2.1. Introduction

2

Introduction

• The Role of the Lexical Analyser

Source

Code

Lexical

Analyser

Syntactic

Analyser

Semantic

Analyser

FRONT END

2

3

• Also known as tokeniser or scanner.

• In Spanish, called analizador Morfológico

• Purpose: translation of the source code into a sequence of symbols.

• The symbols identified by the morphological analyser will be considered terminal symbols in the grammar used by the syntactic analyser.

Lexical Analyser

Introduction

“begin

int A;

A := 100;

A := A+A;

output A

End”

(reserved-word,begin)

(type, int)(<id>,A)(<symb>,;)

(<id>,A)(<mult-symb>,:=)

(<cons int>,100)(<symb>,;)

(<id>,A)(<mult-symb>,:=)

(<id>,A)(<symb>,+)(<id>,A)(<symb>,;)

(reserved-word,output)(<id>,A)

(reserved-word,end)

4

• Other tasks:

• Identification of lexical errors,

• e.g., starting an identifier with a digit where the language does not allow this: 2abc

• Deletion of white-space:

• Usually, the function of white-space is only to separate tokens.

• Exceptions: languages where whitespace indicates code block, e.g., python:

if 1 == 2:

print 1

print 2

• Deletion of comments: not relevant to execution of program.

Lexical Analyser

Introduction

3

5

What are Symbols?

• How do we determine what are the symbols of a given language?

• Case: • Assume we have a language with assignment operator :=• The ‘assignment statement’ has syntax:

STATEMENT → ID ASSIGNOP EXPR ‘;’

• The rule for ASSIGNOP could be:

ASSIGNOP → ‘:=’

…meaning ‘:=’ is a symbol, and thus a unit of lexical analysis.

• However, the rule might have been:

ASSIGNOP → ‘:’ ‘=’

…meaning ‘:’ and ‘=’ are two symbols for lexical analysis

Lexical Analyser

Drawing the border between symbols

A := 1 + 2

6

What are Symbols?

• General Rules:

• A symbol is a sequence of characters that cannot be sepaated from each other by white space.

• Symbols can be separated from other symbols by white space

• With A:=1 + 2

• ‘:=‘ can be separated from ‘A’ and ‘1’

• BUT ‘:’ cannot be separated from ‘=‘

• Thus ‘:=‘ should be treated as a symbol

Lexical Analyser

Drawing the border between symbols

4

7

What token labels to use?

• To determine which token labels we assign to symbols, we first need to derive the syntactic grammar of the language.

• THEN, we extract out the terminal symbols of this grammar, which become the token labels in lexical analysis.

• This ensures that the labels assigned in lexical analysis are what we need in syntactic analysis.

• For example, we might assign the label “reserved_word” to both “begin” and “end”.

• But it is clear we cannot use such a label in parsing:

Program -> reserved_word Statement* reserved_word

• … would allow “end A=1 begin” as a program.

• Each token label has to reflect the different roles that the token class can serve in a program.

Lexical Analyser

Determining the Token set

8

1 : <program> ::= begin <dcl train> ; <stm train> end

2 : <dcl train> ::= <declaration>

3 : | <declaration> ; <dcl train>

4 : <stm train> ::= <statement>

5 : | <statement> ; <stm train>

6 : <declaration>::= <mode> <idlist>

7 : <mode> ::= bool

8 : | int

9 : | ref <mode>

10 : <idlist> ::= <id>

11 : | <id> , <idlist>

12 : <statement> ::= <asgt stm>

13 : | <cond stm>

14 : | <loop stm>

15 : | <transput stm>

15 : | <case stm>

16 : | call <id>

17 : <asgt stm> ::= <id> := <exp>

18 : <cond stm> ::= if <exp> then <stm train> fi

19 : | if <exp> then <stm train> else <stm train>

Lexical Analyser

Identifying the scope of the lexical analysis in the grammar of the language

5

9

20 : <loop stm> ::= while <exp> do <stm train> end

21 : | repeat <stm train> until <exp>

22 : <transput stm> ::= input <id>

23 : | output <exp>

24 : <exp> ::= <factor>

25 : | <exp> + <factor>

26 : | <exp> - <factor>

27 : | - <exp>

28 : <factor> ::= <primary>

29 : | <factor> * <primary>

30 : <primary> ::= <id>

31 : | <constant>

32 : | ( <exp> )

33 : | ( <compare> )

34 : <compare> ::= <exp> = <exp>

35 : | <exp> <= <exp>

36 : | <exp> > <exp>

Lexical Analyser


Topic 2

One and Two Pass Lexical Analysis

6

11

• Identifies symbols and imediately assigns token label to symbol:

Lexical Analyser

One Pass Lexical Analyser

“begin

int A;

A := 100;

A := A+A;

print A

end”

(begin,begin)(type, int) (id,A)

(semic,;) (id,A) (eqsgn,:=)

(int,100)(semic,;) (id,A)

(eqsgn,:=) (id,A) (symb,+) (id,A)

(semic,;)

(reserved-word,output) (id,A)

(end,end)

12

• In a two-pass lexical analyser:

• First pass groups characters into symbols

• Second pass assigns token labels to symbols

Lexical Analyser

Two Pass Lexical Analysis

“begin

int A;

A := 100;

A := A+A;

print A

end”

(begin,begin)(type, int) (id,A)

(semic,;) (id,A) (eqsgn,:=)

(int,100)(semic,;) (id,A)

(eqsgn,:=) (id,A) (symb,+) (id,A)

(semic,;)

(reserved-word,output) (id,A)

(end,end)

“begin” “int”, “A” “;” “A”

“:=” “100” “;” “A” “:=”

“A” “+” “A” “;” “print”

“A” “end”

“begin” “int”, “A” “;” “A”

“:=” “100” “;” “A” “:=”

“A” “+” “A” “;” “print”

“A” “end”

7

13

• Most programming languages are designed such that the code can be segmented into tokens without any knowledge at all of the meaning of the token.

• Simple rules are adhered to:

• White-space ends a symbol

• Multiple white-space ignored

• identifiers contain only alphanumeric chars or _

• identifiers never start with a number

• a symbol starting with a number IS a number: 1, 34, 10.0

• Some chars are always symbol by themselves: } { ; ( ) ,

• mathematical chars can be solo or followed by “=“

• =, >, <, +, -, /, *

• ==, >=, <=, +=, -=, /=, *=

• The first char of the symbol tells us which group it is in

Lexical Analyser

Two Pass Lexical Analysis

14

• Identifier rules:

• Java: Consist of Unicode letter _ $ 0-9Cannot start with 0-9.

• C: Consists of a-z A-Z 0-9 _

Cannot start with 0-9 or _

• Exceptions:

• Lisp:

• Identifier consists of a-Z A-Z 0-9 _ + - * / @ $ = < > . Etc.

• No restriction on starting char

• If char sequence can be interpreted as a number, it is

• Else it is an ‘identifier’

• E.g., ‘1+’ is an ‘identifier’

‘+1’ is a number

8

Topic 3

Methods of Lexical Analysis

16

Three main Approaches:

1) Ad-Hoc Coding : code is written to recognise each type of token.

2) Finite expressions: e.g.,

• float: “[0-9]*.[0-9]+”

• Id: “[a-zA-Z_][a-zA-Z_0-9]*”

3) Context free grammar, e.g.,

Token :- Id | Int | Literal | …

Id :- Alfa | Alfa Id2

Id2 :- Alfa | Digit | Alfa Id2|Digit Id2

…

Lexical Analyser: Using grammars

Approaches to Lexical Analysis

9

Topic 2.1

Ad Hoc Coding of Lexical Analysis: Recognising Symbols

18

• Common approach (1):

• Human writes code to recognise the tokens of the source language:

Lexical Analyser

Two Pass Lexical Analysis with ad-hoc code

def tokenise():

symbolList = []

while not eof():

// process next chars until end of symbol

// add symbol to symbolList

. . .

return symbolList

10

19

Lexical Analyser

def tokenise():

symbolList = []

while not eof():

case type(nextc):

'whitespace': ...

'alpha': ...

'digit': ...

etc.

return symbolList

Two Pass Lexical Analysis with ad-hoc code

20

def tokenise():

symbolList = []

while not eof():

case type(nextc):

'whitespace': ...

'alpha': ...

'digit': ...

etc.

return symbolList

def type (char):

if char in “a-zA-z_”: return ‘alpha’

if char in “0-9”: return ‘digit’

if char in “ \t\n”: return ‘whitespace’

if char in “{};,”: return ‘sepchar’

if char in “><=+-/*”: return ‘mathchar’

Lexical Analyser

11

21

def tokenise():

symbolList = []

while not eof():

case type(nextc):

'alpha': // alpha includes here '_'

symbol = “” + getc()

while type(nextc) in ['alpha', 'digit']:

symbol += getc()

symbolList.append(symbol)

'whitespace': getc()

'digit': ...

...

Lexical Analyser

22

def tokenise():

symbolList = []

while not eof():

case type(nextc):


symbol = “”+getc()


symbol += getc()


'whitespace': getc()

'digit': ...

...

Lexical Analyser

12

23

. . .

‘mathchar': // = > < + - * /


if nextc == '=':

symbol += getc()


'sepchar': // { } ; ,



default: print "ERROR: Unknown Char: “+getc()

Lexical Analyser

24

Numbers:

• Formats: 1, 34, 34.001, .0

• Procedure1) Read digits until we reach a nondigit2) If nextchar is “.” then read digits until we reach a nondigit

Lexical Analyser

13

25

Numbers:

• Formats: 1, 34, 34.001, .0

• Procedure1) Read digits until we reach a nondigit2) If nextchar is “.”, then read digits until we reach a nondigit

‘digit': symbol = “”+getc()

while nextc in “0123456789”:

symbol += getc()

if nextc == “.”:

symbol += getc()

while nextc in “0123456789”:

symbol += getc()


Lexical Analyser

Topic 2.2

Ad Hoc Coding of Lexical Analysis: Assigning Token Labels

14

27

• Second Stage: assigning token labels to symbols

1.Reserved words matched by comparision (or hash lookup)

If Symbol in RESERVED_WORDS: Token = symbol

2.Use regular expressions for user-supplied symbols:

• Int : “[0-9]+”

• Float : “[0-9]*.[0-9]+”

• Id : “[a-zA-Z_][a-zA-Z0-9_]*”

Lexical Analyser

Two Pass Lexical Analyser

Topic 2.2

Ad Hoc Coding of Lexical Analysis:

Single Pass Approach

15

29

The code shown earlier was for recognising symbols.

• Symbols of different types were recognised in different parts.

• We can use this to simplify token labelling

• The specific code used to identify the symbol knows if it is a number, alphanumerical, mathematical or separator.

• Thus, we can use this code to assign token label as well

Lexical Analyser

30

def tokenise():

symbolList = []

while not eof():

while nextc in WHITE_SPACE_CHARS: getc()

symbtype = type(nextc)


case symbtype:

'alpha': ...

'digit': ...

'sepchar': ...

'mathchar': ...

symbolList.append( [token,symbol] )

return symbolList

Lexical Analyser

Single Pass Lexical Analyser

16

31

def tokenise():

symbolList = []

while not eof():

while nextc in WHITE_SPACE_CHARS: getc()

symbtype = type(nextc)


case symbtype:



symbol += getc()

if symbol in RESERVED_WORDS:

token = symbol

else:

token = 'id'

...

symbolList.append( [token,symbol])

return symbolList

Lexical Analyser

Single Pass Lexical Analyser

Topic 2.3

Lexical Analysis using Regular Expressions

17

33

• The previous section looked at lexical analysis informally, just in terms of a computer program written by hand to recognise the tokens of a language.

• The rules of lexical syntax are only represented implicitly in the code. One has to interpret the code to see that an identifier must start with an alpha char.

• In earlier days, this was sufficient.

• However, there are problems with this approach:

• Portability: a change in syntax requires editing of the source code (it may be better to state the lexical structure in an external data file, requiring no need to edit the source code)

• Difficult to prove that the tokenising code actually conforms to the specification of the language – does it do as it should?

• This section will explore lexical analysis from a more formal perspective

Lexical Analyser: Using grammars

Lexical analysis using regular expressions and grammars

34

• One approach is to describe the tokens in terms of regular expressions

• A program can read in these regular expressions and generate code to perform the tokenisation

• FLEX and LEX

Lexical Analyser

Lexical analysis using regular expressions

18

35

• LEX and YACC (Yet Another Compiler Compiler) are often used to build a compiler quickly

• One does not write the lexical analyser directly, just the patterns to recognise tokens

Lexical Analyser

Standard Compiler Architecture

Lex

Lexical

rules

My

LexAnlyser

Yacc

Syntactic

rules

My

SynAnalyserSource

Code

36

DIGIT [0-9]

ID [a-z][a-z0-9]*

{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }

{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}

if|then|begin|end|procedure|function

{ printf( “(%s,: ‘%s’)", yytext, yytext);}

{ID} printf( “(id, ‘%s’)", yytext );

"+"|"-"|"*"|"/" printf( “(mathop, ‘%s’)", yytext );

[ \t\n]+ /* eat up whitespace */

. printf( "Unrecognized character: %s\n", yytext );

Lexical Analyser

Flex input for a simple tokeniser

19

37

• Flex generates C code to procede char by char through the input text.

• When the generated scanner is run, it analyzes its input looking for stringswhich match any of its patterns.

• If it finds more than one match, it takes the one matching the most text.

� Thus “23.56” will be recognised as float not integer

{DIGIT}+ { printf( “ (integer, ‘%s’) ", yytext); }

{DIGIT}+"."{DIGIT}* { printf( “(float, ‘%s’)", yytext);}

• Once the match is determined, the text corresponding to the match is made available in the global character pointer yytext,

• The action corresponding to the matched pattern is then executed.

• After recognising a token, input scanning for all patterns restarts from thatpoint (any partially matched pattern is discarded).

Lexical Analyser

Flex processing

Topic 2.4

Lexical Analysis using CFGs

(Context Free Grammars)

20

39

Using Grammars for token recognition

• More formally, one can describe the possible tokens of a language using a context-free grammar<Token> ::= <Id> | <ReservedWd> | <Number> | ...

<Id> ::= <Letter> | <Letter><IdC>

<IdC> ::= <Letter> | <Digit> | <Letter><IdC> | <Digit><IdC>

• This has the advantage that both the syntactic description of thelanguage and its lexical description are in the same language.

• Processing of such grammars is however slower than for regular expressions.

Lexical Analyser

Using a CFG

40

Using Grammars for token recognition

• However, some context free grammars can be automaticallytranslated into right-regular grammars, where rules are of twoforms: ( ‘a’ is a terminal; ‘A’ is a nonterminal )

• A → a

• A → aB

• A grammar in such a form can then be represented as a deterministic finite automaton, (DFA) which allows efficientprocessing of the input.

• a DFA is a finite state machine where, for each pair of state and input symbol, there is one and only one transition to a next state.

Lexical Analyser

Definite Finite Automata

21

41

To derive a right regular grammar from full context free grammar:

1. For each rule whose RHS starts with a nonterminal, replace the nonterminal with its expansion(s)

E.g. Number ::= Integer

Integer ::= IntegerSS | - IntegerSS

IntegerSS ::= digit | digit IntegerSS

Number ::= IntegerSS | - IntegerSS


Number ::= digit | digit IntegerSS | - IntegerSS


Lexical Analyser

Deriving a right-regular grammars

42

To derive a right regular grammar from full context free grammar:

1. For each rule whose RHS starts with a nonterminal, replace the nonterminal with its expansion(s)

2.At the end of replacements, eliminate any rule which cannot be reached from the START symbol

e.g., assume grammar has start symbol: Token<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | …

We only preserve nonterminals referenced in this rule or referenced in thenonterminals it contains, etc.

Lexical Analyser

Deriving a right-regular grammars

22

43

Lexical Analyser

<Token> ::= <Id> | <ReservedWd> | <Number> | <StringLit> | <CharLit> | <SS> | <MS>

<Id> ::= <Letter> | <Letter> <IdC>

<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>

<Number> ::= <Integer> | <Real>

<Integer> ::= <IntegerSS> | - <IntegerSS>

<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>

<Real> ::= <FixedPoint> | <FixedPoint> <Exponente>

<FixedPoint> ::= <Integer> . <IntegerSS> | . <IntegerSS> | <Integer> .

<Exponent> ::= E <Integer>

<StringLit> ::= "" | " <CharSeq> "

<CharSeq> ::= <Character> | <Character> <CharSeq>

<CharLit> ::= ' <Character> '

<SS> ::= + | - | * | / | = | < | > | ( | ) / Simple Symbol/

<MS> ::= =+ | != | <= | >= | ++ | -- / Multiple Symbol /

<Character> ::= <Letter> | <Digit> | <SS> | ! | . | , | b | \' | \" | \n

<Letter> ::= A | B | ... | Z | a | b | ... | z

<Digit> ::= 0 | 1 | ... | 9

A samle CFG grammar for tokens

44

Lexical Analyser

<Token> ::= <Letter>

| <Letter> <IdC>

| <ReservedWd>

| <Digit>

| <Digit> <IntegerSS>

| - <IntegerSS>

| <Digit>. <IntegerSS>

| <Digit> <IntegerSS> . <IntegerSS>

| - <IntegerSS> . <IntegerSS>

| . <IntegerSS>

| <Digit> . | <Digit> <IntegerSS> .

| - <IntegerSS> .

…

| ""

| " <CharSeq> "

| ' <Character> '

| + | - | * | / | = | < | > | ( | ) | =+ | != | <= | >= | ++ | --

<IdC> ::= <Letter> | <Digit> | <Letter> <IdC> | <Digit> <IdC>

<IntegerSS> ::= <Digit> | <Digit> <IntegerSS>

Same grammar converted to a RRG (almost)

23

45

Lexical Analyser

RRG converted to a DFA

46

Lexical Analyser

A more complex example

24

47

1 : <program> ::= begin <dcl train> ; <stm train> end

2 : <dcl train> ::= <declaration>

3 : | <declaration> ; <dcl train>

4 : <stm train> ::= <statement>

5 : | <statement> ; <stm train>

6 : <declaration>::= <mode> <idlist>

7 : <mode> ::= bool

8 : | int

9 : | ref <mode>

10 : <idlist> ::= <id>

11 : | <id> , <idlist>

12 : <statement> ::= <asgt stm>

13 : | <cond stm>

14 : | <loop stm>

15 : | <transput stm>

15 : | <case stm>

16 : | call <id>

17 : <asgt stm> ::= <id> := <exp>

18 : <cond stm> ::= if <exp> then <stm train> fi

19 : | if <exp> then <stm train> else <stm

train> fi

Lexical Analyser

Complex Grammar of ASPLE (lexical and syntactic)

48

20 : <loop stm> ::= while <exp> do <stm train> end

21 : | repeat <stm train> until <exp>

22 : <transput stm> ::= input <id>

23 : | output <exp>

24 : <exp> ::= <factor>

25 : | <exp> + <factor>

26 : | <exp> - <factor>

27 : | - <exp>

28 : <factor> ::= <primary>

29 : | <factor> * <primary>

30 : <primary> ::= <id>

31 : | <constant>

32 : | ( <exp> )

33 : | ( <compare> )

34 : <compare> ::= <exp> = <exp>

35 : | <exp> <= <exp>

36 : | <exp> > <exp>

Lexical Analyser


26

51

<Token> ::= begin | ; | end | bool | int | ref | ,

| call | := | if | then | fi | else

| while | do | repeat | until | input

| output | + | - | * | ( | ) | = | <= | >

| case | esac | : | procedure

| true | false| 0 | ··· | 9 | 0 <int constant> | ···

| 9 <int constant>

| A | ··· | Z | a | ··· | z

| A <rest id> | ··· | Z <rest id>

| a <rest id> | ··· | z <rest id>

<int constant> ::= 0 | ··· | 9 | 0 <int constant> | ···

| 9 <int constant>

<rest id> ::= A | ··· | Z | a | ··· | z | 0 | ··· | 9

| 0 <rest id> | ··· | 9 <rest id>

| A <rest id> | ··· | Z <rest id>

| a <rest id> | ··· | z <rest id>

Lexical Analyser

The Right-regular grammar

52

A,...,Z,

a,...,z

Lexical Analyser

Graph associated to the grammar

SU

int

constant

rest id

λλλλ

0,...,9

0,...,9

A,...,Z,

a,...,z,

0,...,9

begin,end,bool,int,ref,

call,if,then,fi,else,while,do,repeat,

until,input,output,case,esac,

procedure,true,false,0,···,9,A,...,Z,

a,...z,;,,,+, -,*,(,),=,:=,<=,,>,:

A,...,Z,

a,...,z,

0,...,9

0,...,9

27

53

• ASPLE just provides a few data types, but there are others which are also very common:

• Real numbers, e.g. 3.45, .44, –5., 3.45E2, .44E-2, –5.E123

<real>::=<fixed point>

| <fixed point><exponent>

<integer>::=<int constant>

|-<int constant>

<fixed point>::=<integer>.<int constant>

| .<int constant>

| <integer>.

<exponent>::=E<integer>

• Character and strings, e.g. “”, “hello world”, ‘a’,...,‘z’

<literal>::=“” | “<string>”

<character>::=‘<symbol>’

<string> ::= <symbol> | <symbol><string>

Lexical Analyser

Other patterns

Topic 3

Semantic Actions in Lexical Analysis

28

55

• The compiler can delegate some semantic tasks to the lexical analyser:

• Storing the information about the identifiers in the symbols table.

• Calculating the numeric values (in binary code) for each numeric constant.

• etc.

• These tasks vary according to:

• The objectives of the translators / interpreters.

• The division of tasks between the different components in the translator / interpreter.

Lexical analyser: Semantic actions

Previous concepts

56

• These actions are sometimes expressed by inserting actions between the symbols in the rules. For instance:

<id>::= actionf0 <letter> actionf1 actionf2 <rest id> actionf3

<rest id>::= <letter> actionf1 actionf2 <rest id>

| <digit> actionf1 actionf2 <rest id>

| λ

where

• actionf0 might be: initialise a counter

• actionf1 add 1 to the counter

• actionf2 copy the character which has just been recognised inside a buffer.

• actionf3 add to the buffer an end-of-string mark. Check that the number of characters is not higher than the maximum length allowed. If this happens, notify the error. Otherwise, insert the identifier in the symbols table and return a pointer to the element inside the table.


Semantic actions

29

57

• Other example:

<integer>::= actiong0 <int constant>

| actiong0 -<int constant> actiong2

<int constant>::= <digit> actiong1

| <digit> actiong1 <int constant>

• where

• actiong0 can be the initialisation of an integer variable

value←0

• actiong1 performs the calculation of the value of the number which

has been read until now

value←(10*value) + value(digit)

• actiong2 changes the sign of the value calculated:

value← -value


Semantic actions

58

Three main Approaches:

1) Ad-Hoc Coding

2) Finite expressions: e.g.,

• float: “[0-9]*.[0-9]+”

• Id: “[a-zA-Z_][a-zA-Z_0-9]*”

3) Context free grammar, converted to RRG and DFA

Token :- Id | Int | Literal | …

Id :- Alfa | Alfa Id2

Id2 :- Alfa | Digit | Alfa Id2 | Digit Id2

…

Lexical Analyser: Summary

Summary

topic 2: lexical analysis - uamarantxa.ii.uam.es/~modonnel/compilers/02_lexicalanalysis.pdf1...

Documents