compiler construction in4020 – lecture 2
DESCRIPTION
Compiler construction in4020 – lecture 2. Koen Langendoen Delft University of Technology The Netherlands. program in some source language. executable code for target machine. semantic represen- tation. front-end analysis. back-end synthesis. compiler. Summary of lecture 1. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/1.jpg)
Compiler constructionin4020 – lecture 2
Koen Langendoen
Delft University of TechnologyThe Netherlands
![Page 2: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/2.jpg)
Summary of lecture 1
• compiler is a structured toolbox
• front-end: program text annotated AST
• back-end: annotated AST executable code
• lexical analysis: program text tokens• token specifications
• implementation by hand
program
in some
source
language
front-endanalysis
semanticrepresen-
tation
executable
code for
target
machine
back-endsynthesis
compiler
![Page 3: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/3.jpg)
Quiz
2.7 What does the regular expression a?* mean? And a** ?
Are these expressions erroneous?
Are they ambiguous?
![Page 4: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/4.jpg)
Overview
• Generating a lexical analyzer• generic methods
• specific tool lex
program text
lexical analysis
syntax analysis
context handling
annotated AST
tokens
AST
scanner
generator
token
description
![Page 5: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/5.jpg)
Token description
• (f)lex: scanner generator for UNIX• token description C code
• format of the lex input file:
definitions
%%
rules
%%
user code
regular descriptions
regular expressions + actions
auxiliary C-code
![Page 6: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/6.jpg)
Lex description to recognize integers
• an integer is a non-zero sequence of digits optionally followed by a letter denoting the base class (b for binary and o for octal).
• base [bo]integer digit+ base?
• rule = expr + action
• {} signal applicationof a description
%{
#include "lex.h"
%}
base [bo]
digit [0-9]
%%
{digit}+ {base}? {return INTEGER;}
%%
![Page 7: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/7.jpg)
Lexresulting C-code• char yytext[]; /* token representation */
• int yylex(void); /* returns type of next token */
• wrapper function
to add token
attributes
%%
\n {line_number++;}
%%
void get_next_token(void) {
Token.class = yylex();
if (Token.class == 0) {
Token.class = EOF;
Token.repr = "<EOF>";
return;
}
Token.pos.line_number = line_number;
Token.repr = strdup(yytext);
}
![Page 8: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/8.jpg)
automatic generation
program text
lexical analysis
syntax analysis
context handling
annotated AST
tokens
AST
scanner
generator
token
description
finite state automaton
S0
‘.’digit
S2
digit
S3
digit
digit
S1
‘.’
![Page 9: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/9.jpg)
Finite-state automaton
• Recognize input character by character
• Transfer between states
• FSA• Initial state S0
• set of accepting states
• transition function: State x Char State
S0‘i’
S1 S2‘f’
![Page 10: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/10.jpg)
FSA examples
• integral_number [0-9]+
• fixed_point_number [0-9]* ‘.’ [0-9]+
digit
S0‘.’ digit
S2
digit
S3
digitS0
digit
S1
![Page 11: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/11.jpg)
Concurrent recognition
• integral_number [0-9]+
• fixed_point_number [0-9]* ‘.’ [0-9]+
• recognize both
tokens in one pass
digit
S0‘.’ digit
S2
digit
S3
digitS0
digit
S1
![Page 12: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/12.jpg)
Concurrent recognition
• integral_number [0-9]+
• fixed_point_number [0-9]* ‘.’ [0-9]+
• naïve approach:
merge initial states
digit
S0
‘.’
digitS2
digit
S3
digit
digit
S1
![Page 13: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/13.jpg)
Concurrent recognition
• integral_number [0-9]+
• fixed_point_number [0-9]* ‘.’ [0-9]+
• correct approach:
share common
prefix transitions
S0
‘.’
digitS2
digit
S3
digit
digit
S1
‘.’
![Page 14: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/14.jpg)
FSA implementation:transition table
• concurrent recognition of integers and fixed point numbers
state
character
recognized token
digit dot other
S0 S1 S2 -S1 S1 S2 - integerS2 S3 - -S3 S3 - - fixed point
S0
‘.’
digitS2
digit
S3
digit
digit
S1
‘.’
![Page 15: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/15.jpg)
FSA exercise (6 min.)
• draw an FSA to recognize integers
base [bo]integer digit+ base?
• draw an FSA to recognize the regular expression (a|b)*bab
![Page 16: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/16.jpg)
Answers
![Page 17: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/17.jpg)
Answers
• integer
• (a|b)*bab
S2 S3S0 S1
a
b
b
b
b
a
a
a
digitS0
digit
S1 S2[bo]
![Page 18: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/18.jpg)
Break
![Page 19: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/19.jpg)
Automatic generation:description FSA
• start with initial set (S0) of all token descriptions to be recognized
• for each character (ch) • find the set (Sch) of descriptions that can start
with ch
• extend the FSA with transition (S0,ch, Sch)
• repeat adding transitions (to Sch ) until no new set is generated
![Page 20: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/20.jpg)
Dotted items
• keeping track of matched characters in a token description: T R
regular expression
input
already matched still to be matched
T
![Page 21: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/21.jpg)
Types of dotted items
• shift item: dot in front of a basic pattern• if ‘i’ ‘f’• if ‘i’ ‘f’• identifier [a-z] [a-z0-9]*
• reduce item: dot at the end• if ‘i’ ‘f’ • identifier [a-z] [a-z0-9]*
• non-basic item: dot in front of repeated pattern or parenthesis• identifier [a-z] [a-z0-9]*
![Page 22: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/22.jpg)
Character moves
input T c c
input c T c
![Page 23: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/23.jpg)
Character moves
• T c • T [class] • T .
input T c c
input c T c
c
c class
T c
T . T [class]
![Page 24: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/24.jpg)
moves
T (R)? T (R)?
T ( R)?
T (R)* T (R)*
T ( R)*
T (R )* T (R)*
T ( R)*
T (R )? T (R)?
![Page 25: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/25.jpg)
moves
T (R)+ T ( R)+
T (R )+ T (R)+
T (R1|R2|…) T ( R1|R2|…)
T (R1| R2|…)
…
T (R1 |R2|…) T (R1|R2|…)
… … …
T ( R)+
![Page 26: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/26.jpg)
FSA construction
• a state corresponds to a set of basic items
• a character move yields a new set
• expand non-basic items into basic items
using moves
• see if the resulting set was produced before, if not introduce a new state
• add transition
![Page 27: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/27.jpg)
ExampleFSA construction
• tokens• integer: I (D)+
• fixed-point: F (D)* ‘.’ (D)+
• initial state
I (D)+
F (D)* ‘.’ (D)+
I ( D)+
F ( D)* ‘.’ (D)+
F (D)* ‘.’ (D)+
S0
moves
![Page 28: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/28.jpg)
F (D)* ‘.’ (D )+
F (D)* ‘.’ (D)+ F (D)* ‘.’ ( D)+
S3
D
I (D )+F (D )* ‘.’ (D)+
ExampleFSA construction
• character moves
I ( D)+
F ( D)* ‘.’ (D)+
F (D)* ‘.’ (D)+
S0
F (D)* ‘.’ (D)+F (D)* ‘.’ ( D)+
I (D)+ I ( D)+
F (D )* ‘.’ (D)+
I (D)+ I ( D)+
F (D)* ‘.’ (D)+
F ( D)* ‘.’ (D)+
S1
D
‘.’
S2
‘.’
D
D
![Page 29: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/29.jpg)
ExerciseFSA construction (7 min.)
• draw the FSA (with item sets) for recognizing an identifier:
identifier letter (letter_or_digit_or_und* letter_or_digit+)?
• extend the above FSA to recognize the keyword ‘if’ as well.
if ‘i’ ‘f’
![Page 30: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/30.jpg)
Answers
![Page 31: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/31.jpg)
Answers
ID L (( LDU)* (LD)+)?
ID L ((LDU)* ( LD)+)?S2
ID L ((LDU)* (LD)+)?S0
ID L ((LDU)* (LD)+)? ID L (( LDU)* (LD)+)?
ID L ((LDU)* ( LD)+)?S1
L
LDU
LD
U
‘i’
LD
‘f’
LD
U
U
S3
S4
accepting states S1 and S4
![Page 32: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/32.jpg)
Transition table compression
state
characterrecognized
token‘i’ ‘f’ L D U
S0 S3 S1 S1 - -
S1 S1 S1 S1 S1 S2 identifier
S2 S1 S1 S1 S1 S2
S3 S1 S4 S1 S1 S2
S4 S1 S1 S1 S1 S2 keyword if
![Page 33: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/33.jpg)
Transition table compression
• redundant rows
• empty transitions
state
characterrecognized
token‘i’ ‘f’ L D U
S0 S3 S1 S1 - -
S1 S1 S1 S1 S1 S2 identifier
S2 S1 S1 S1 S1 S2
S3 S1 S4 S1 S1 S2
S4 S1 S1 S1 S1 S2 keyword if
S1S4 S1S1S2S1 S1 S2S1S1S1S3
S0
S1
S2
S4 S3
row displacement
![Page 34: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/34.jpg)
Summary: generating a lexical analyzer
• tool: lex• token descriptions + actions
• wrapper interface
• FSA construction• dotted items
• character moves
• moves
program text
lexical analysis
syntax analysis
context handling
annotated AST
tokens
AST
scanner
generator
token
description
![Page 35: Compiler construction in4020 – lecture 2](https://reader036.vdocument.in/reader036/viewer/2022070404/56813c27550346895da5a141/html5/thumbnails/35.jpg)
Homework
• study sections 2.1.10 – 2.1.12• lexical identification of tokens• symbol tables• macro processing
• print handout lecture 3 [blackboard]
• find a partner for the “practicum”• register your group
• send e-mail to [email protected]