winter 2012-2013 compiler principles lexical analysis (scanning)

72
Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning) Mayer Goldberg and Roman Manevich Ben-Gurion University

Upload: gilles

Post on 25-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning). Mayer Goldberg and Roman Manevich Ben-Gurion University. General stuff. Topics taught by me Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

Winter 2012-2013Compiler Principles

Lexical Analysis (Scanning)

Mayer Goldberg and Roman ManevichBen-Gurion University

Page 2: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

2

General stuff Topics taught by me

Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation

Slides will be available from web-site after lecture

Request: please mute mobiles, tablets, super-cool squeaking devices

Page 3: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

3

Today Understand role of lexical

analysis

Lexical analysis theory

Implementing modern scanner

Page 4: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

4

Role of lexical analysis First part of compiler front-end

Convert stream of characters into stream of tokens Split text into most basic meaningful

strings Simplify input for syntax analysis

High-level

Language

(scheme)

Executable Code

LexicalAnalysi

s

Syntax Analysi

sParsing

AST Symbol

Tableetc.

Inter.Rep.(IR)

CodeGeneration

Page 5: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

5

From scanning to parsing5 + (7 * x)

) id * num ( + num

Lexical Analyzer

program text

token stream

Parser

Grammar:E id E numE E + EE E * EE ( E ) +

num

num x

*

Abstract Syntax Tree

validsyntaxerror

Page 6: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

6

Javascript example

var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];

for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}

Identify basic units in this code

Page 7: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

7

Javascript example

var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];

for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}

Identify basic units in this code

Page 8: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

8

Javascript example Identify basic units in this code

var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];

for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}

keyword numeric literaloperator

string literal

punctuation

identifierwhitespace

Page 9: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

9

Scanner output

var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"];

for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}

1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...

Stream of TokensLINE: ID(value)

Page 10: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

10

What is a token? Lexeme – substring of original text

constituting an identifiable unit Identifiers, Values, reserved words, …

Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the

parser Different for different languages

Page 11: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

11

C++ example 1 Splitting text into tokens can be tricky How should the code below be split?

vector<vector<int>> myVector

>>operator

>, >two tokensor ?

Page 12: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

12

C++ example 2 Splitting text into tokens can be tricky How should the code below be split?

vector<vector<int> > myVector

>, >two tokens

Page 13: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

Example tokensType ExamplesIdentifier x, y, z, foo, barNUM 42FLOATNUM -3.141592654STRING “so long, and thanks for all the fish”LPAREN (RPAREN )IF if…

13

Page 14: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

14

Separating tokens

Type ExamplesComments /* ignore code */

// ignore until end of lineWhite spaces \t \n

Lexemes are recognized but get consumed rather than transmitted to parser if

i fi/*comment*/f

Page 15: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

15

Preprocessor directives in C

Type ExamplesInlude directives #include<foo.h>Macros #define THE_ANSWER 42

Page 16: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

16

Designing a scanner Define each type of lexeme

Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings

But how do we define lexemes of unbounded length?

Page 17: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

17

Designing a scanner Define each type of lexeme

Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings

But how do we define lexemes of unbounded length? Regular expressions

Page 18: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

18

Regular languages refresher Formal languages

Alphabet = finite set of letters Word = sequence of letter Language = set of words

Regular languages defined equivalently by Regular expressions Finite-state automata

Page 19: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

19

Regular expressions Empty string: Є Letter: a Concatenation: R1 R2 Union: R1 | R2 Kleene-star: R*

Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*)

What is this language?

Page 20: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

20

Exercise 1 - Question Language of Java identifiers

Identifiers start with either an underscore ‘_’or a letter

Continue with either underscore, letter, or digit

Page 21: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

21

Exercise 1 - Answer Language of Java identifiers

Identifiers start with either an underscore ‘_’or a letter

Continue with either underscore, letter, or digit

(_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros

First = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*

Page 22: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

22

Exercise 2 - Question Language of rational numbers in

decimal representation (no leading, ending zeros) 0 123.757 .933333 Not 007 Not 0.30

Page 23: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

23

Exercise 2 - Answer Language of rational numbers in

decimal representation (no leading, ending zeros)

Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg

Page 24: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

24

Exercise 3 - Question Equal number of opening and closing

parenthesis: [n]n = [], [[]], [[[]]], …

Page 25: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

25

Exercise 3 - Answer Equal number of opening and closing

parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar:

S ::= [] | [S]

Page 26: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

26

Finite automata

start

a

b

b

c

acceptingstate

startstate

transition

An automaton is defined by states and transitions

Page 27: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

27

Automaton running example

start

a

b

b

c

Words are read left-to-rightc b a

Page 28: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

28

Automaton running example

start

a

b

b

c

Words are read left-to-rightc b a

Page 29: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

29

Automaton running example

start

a

b

b

c

Words are read left-to-rightc b a

Page 30: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

30

Automaton running example

start

a

b

b

c

Words are read left-to-rightword

acceptedc b a

Page 31: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

31

Word outside of language

start

a

b

b

c

c b b

Page 32: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

32

Word outside of language Missing transition means non-

acceptance

start

a

b

b

c

c b b

Page 33: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

33

Exercise - Question What is the language defined by the

automaton below?

start

a

b

b

c

Page 34: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

34

Exercise - Answer What is the language defined by the

automaton below? a b* c Generally: all paths leading to accepting

states

start

a

b

b

c

Page 35: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

35

Non-deterministic automata Allow multiple transitions from given

state labeled by same letter

start

a

a

b

c

b

c

Page 36: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

36

NFA run example

c b a

start

a

a

b

c

b

c

Page 37: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

37

NFA run example Maintain set of states

c b a

start

a

a

b

c

b

c

Page 38: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

38

NFA run example

c b a

start

a

a

b

c

b

c

Page 39: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

39

NFA run example Accept word if any of the states in the

set is acceptingc b a

start

a

a

b

c

b

c

Page 40: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

40

NFA+Є automata Є transitions can “fire” without

reading the input

start a

b

c

Є

Page 41: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

41

NFA+Є run example

start a

b

c

c b a

Є

Page 42: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

42

NFA+Є run example Now Є transition can non-

deterministically take place

start a

b

c

c b a

Є

Page 43: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

43

NFA+Є run example

start a

b

c

c b a

Є

Page 44: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

44

NFA+Є run example

start a

b

c

c b a

Є

Page 45: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

45

NFA+Є run example

start a

b

c

c b a

Є

Page 46: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

46

NFA+Є run example

start a

b

c

c b a

Є

Word accepted

Page 47: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

47

Reg-exp vs. automata Regular expressions are declarative

Offer compact way to define a regular language by humans

Don’t offer direct way to check whether a given word is in the language

Automata are operative Define an algorithm for deciding whether

a given word is in a regular language Not a natural notation for humans

Page 48: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

48

From reg. exp. to automata Theorem: there is an algorithm to

build an NFA+Є automaton for any regular expression

Proof: by induction on the structure of the regular expression For each sub-expression R we build an

automaton with exactly one start state and one accepting state

Start state has no incoming transitions Accepting state has no outgoing

transitions

Page 49: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

49

From reg. exp. to automata Theorem: there is an algorithm to

build an NFA+Є automaton for any regular expression

Proof: by induction on the structure of the regular expression

start

Page 50: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

50

Base cases

R =

R = a

start

start a

Page 51: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

51

Construction for R1 | R2

start

R1

R2

Page 52: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

52

Construction for R1 R2

start

R1 R2

Page 53: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

53

Construction for R*

start

R

Page 54: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

54

From NFA+Є to DFA Construction requires O(n) states for

a reg-exp of length n Running an NFA+Є with n states on

string of length m takes O(m·n2) time Solution: determinization via subset

construction Number of states worst-case exponential

in n Running time O(m)

Page 55: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

55

Subset construction For an NFA+Є with states M={s1,

…,sk} Construct a DFA with one state per

set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}

Simulate transitions between individual states for every letter as1 s2 a[s1,s4] [s2,s7]

NFA+Є DFA

as4 s7

Page 56: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

56

Subset construction For an NFA+Є with states M={s1,

…,sk} Construct a DFA with one state per

set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}

Extend macro states by states reachable via Є transitions

Єs1 s4 [s1,s2] [s1,s2,s4]NFA+Є DFA

Page 57: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

57

Scanning challenges Regular expressions allow us to define

the language of all sequences of tokens

Automata theory provides an algorithm for checking membership of words But we are interested in splitting the text

not just deciding on membership How do we determine lexemes? How do we handle ambiguities –

lexemes matching more than one token?

Page 58: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

58

Separating lexemes ID = (a+b+…+z) (a+b+…+z)*

ONE= 1 Input: abb1 How do we identify ID(abb), ONE?

Page 59: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

59

Separating lexemes ID = (a+b+…+z) (a+b+…+z)*

ONE= 1 Input: abb1 How do we identify ID(abb), ONE?

start

a-z

1

a-zID

ONE

Page 60: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

60

Maximal munch ID = (a+b+…+z) (a+b+…+z)*

ONE= 1 Input: abb1 How do we identify ID(abb), ONE? Solution: find longest matching

lexeme Keep reading text until automaton leaves

accepting state Return token corresponding to accepting

state Reset – go back to start state and

continue reading input from there

Page 61: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

61

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*

IF = if Input: if Matches both tokens What should the scanner output?

start

a-z

i

a-zID

IFfNFA

Page 62: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

62

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*

IF = if Input: if Matches both tokens What should the scanner output?

start

a-z\i

i

a-zID

IF IDfID

a-z\f DFAa-z

Page 63: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

63

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*

IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of

definitions Output: ID(if)

start

a-z\i

i

a-zID

IF IDfID

a-z\fa-z

Page 64: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

64

Handling ambiguities IF = if

ID = (a+b+…+z) (a+b+…+z)* Input: if Matches both tokens What should the scanner output? Solution: break tie using order of

definitions Output: IF

Conclusion: list keywordtoken definitions

before identifier definition

start

a-z\i

i

a-zID

IF IDfID

a-z\fa-z

Page 65: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

65

Implementing scanners in practice

Page 66: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

66

Implementing scanners Manual construction of automata +

determinization is Very tedious Error-prone Non-incremental

Fortunately there are tools that automatically generate code from a specification for most languages C: Lex, Flex

Java: JLex, JFlex

Page 67: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

67

Using JFlex Define tokens (and states) Run Jflex to generate Java

implementation Usually MyScanner.nextToken() will

be called in a loop by parser

RegularExpressions JFlex MyScanner.java

Stream of characters

Tokens

MyScanner.lex

Page 68: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

68

Common format for reg-exps

Basic Patterns Matchingx The character x. Any character, usually except a new line[xyz] Any of the characters x,y,zRepetition OperatorsR? An R or nothing (=optionally an R)R* Zero or more occurrences of RR+ One or more occurrences of RComposition OperatorsR1R2 An R1 followed by R2R1|R2 Either an R1 or R2Grouping(R) R itself

Page 69: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

69

Escape characters What is the expression for one or

more + symbols? (+)+ won’t work (\+)+ will

backslash \ before an operator turns it to standard character

\*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t

Page 70: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

70

Shorthands Use names for expressions

letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)*

Use hyphen to denote a range letter = a-z | A-Z digit = 0-9

Page 71: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

71

Catching errors What if input doesn’t match any

token definition? Trick: Add a “catch-all” rule that

matchesany character and reports an error Add after all other rules

Page 72: Winter  2012-2013 Compiler  Principles Lexical  Analysis (Scanning)

72

Next lecture: parsing