winter 2012-2013 compiler principles lexical analysis (scanning)

Winter 2012-2013Compiler Principles

Lexical Analysis (Scanning)

Mayer Goldberg and Roman ManevichBen-Gurion University

2

General stuff Topics taught by me

Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation

Slides will be available from web-site after lecture

Request: please mute mobiles, tablets, super-cool squeaking devices

3

Today Understand role of lexical

analysis

Lexical analysis theory

Implementing modern scanner

4

Role of lexical analysis First part of compiler front-end

Convert stream of characters into stream of tokens Split text into most basic meaningful

strings Simplify input for syntax analysis

High-level

Language

(scheme)

Executable Code

LexicalAnalysi

s

Syntax Analysi

sParsing

AST Symbol

Tableetc.

Inter.Rep.(IR)

CodeGeneration

5

From scanning to parsing5 + (7 * x)

) id * num ( + num

Lexical Analyzer

program text

token stream

Parser

Grammar:E id E numE E + EE E * EE ( E ) +

num

num x

*

Abstract Syntax Tree

validsyntaxerror

6

Javascript example

var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];

for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}

Identify basic units in this code

7

Javascript example



Identify basic units in this code

8

Javascript example Identify basic units in this code



keyword numeric literaloperator

string literal

punctuation

identifierwhitespace

9

Scanner output

var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"];


1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...

Stream of TokensLINE: ID(value)

10

What is a token? Lexeme – substring of original text

constituting an identifiable unit Identifiers, Values, reserved words, …

Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the

parser Different for different languages

11

C++ example 1 Splitting text into tokens can be tricky How should the code below be split?

vector<vector<int>> myVector

>>operator

>, >two tokensor ?

12

C++ example 2 Splitting text into tokens can be tricky How should the code below be split?

vector<vector<int> > myVector

>, >two tokens

Example tokensType ExamplesIdentifier x, y, z, foo, barNUM 42FLOATNUM -3.141592654STRING “so long, and thanks for all the fish”LPAREN (RPAREN )IF if…

13

14

Separating tokens

Type ExamplesComments /* ignore code */

// ignore until end of lineWhite spaces \t \n

Lexemes are recognized but get consumed rather than transmitted to parser if

i fi/*comment*/f

15

Preprocessor directives in C

Type ExamplesInlude directives #include<foo.h>Macros #define THE_ANSWER 42

16

Designing a scanner Define each type of lexeme

Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings

But how do we define lexemes of unbounded length?

17

Designing a scanner Define each type of lexeme

Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings

But how do we define lexemes of unbounded length? Regular expressions

18

Regular languages refresher Formal languages

Alphabet = finite set of letters Word = sequence of letter Language = set of words

Regular languages defined equivalently by Regular expressions Finite-state automata

19

Regular expressions Empty string: Є Letter: a Concatenation: R1 R2 Union: R1 | R2 Kleene-star: R*

Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*)

What is this language?

20

Exercise 1 - Question Language of Java identifiers

Identifiers start with either an underscore ‘_’or a letter

Continue with either underscore, letter, or digit

21

Exercise 1 - Answer Language of Java identifiers

Identifiers start with either an underscore ‘_’or a letter

Continue with either underscore, letter, or digit

(_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros

First = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*

22

Exercise 2 - Question Language of rational numbers in

decimal representation (no leading, ending zeros) 0 123.757 .933333 Not 007 Not 0.30

24

Exercise 3 - Question Equal number of opening and closing

parenthesis: [n]n = [], [[]], [[[]]], …

25

Exercise 3 - Answer Equal number of opening and closing

parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar:

S ::= [] | [S]

26

Finite automata

start

a

b

b

c

acceptingstate

startstate

transition

An automaton is defined by states and transitions

27

Automaton running example

start

a

b

b

c

Words are read left-to-rightc b a

28


start

a

b

b

c


29


start

a

b

b

c


30


start

a

b

b

c

Words are read left-to-rightword

acceptedc b a

31

Word outside of language

start

a

b

b

c

c b b

32

Word outside of language Missing transition means non-

acceptance

start

a

b

b

c

c b b

33

Exercise - Question What is the language defined by the

automaton below?

start

a

b

b

c

34

Exercise - Answer What is the language defined by the

automaton below? a b* c Generally: all paths leading to accepting

states

start

a

b

b

c

35

Non-deterministic automata Allow multiple transitions from given

state labeled by same letter

start

a

a

b

c

b

c

36

NFA run example

c b a

start

a

a

b

c

b

c

37

NFA run example Maintain set of states

c b a

start

a

a

b

c

b

c

38

NFA run example

c b a

start

a

a

b

c

b

c

39

NFA run example Accept word if any of the states in the

set is acceptingc b a

start

a

a

b

c

b

c

40

NFA+Є automata Є transitions can “fire” without

reading the input

start a

b

c

Є

41

NFA+Є run example

start a

b

c

c b a

Є

42

NFA+Є run example Now Є transition can non-

deterministically take place

start a

b

c

c b a

Є

43

NFA+Є run example

start a

b

c

c b a

Є

44

NFA+Є run example

start a

b

c

c b a

Є

45

NFA+Є run example

start a

b

c

c b a

Є

46

NFA+Є run example

start a

b

c

c b a

Є

Word accepted

47

Reg-exp vs. automata Regular expressions are declarative

Offer compact way to define a regular language by humans

Don’t offer direct way to check whether a given word is in the language

Automata are operative Define an algorithm for deciding whether

a given word is in a regular language Not a natural notation for humans

48

From reg. exp. to automata Theorem: there is an algorithm to

build an NFA+Є automaton for any regular expression

Proof: by induction on the structure of the regular expression For each sub-expression R we build an

automaton with exactly one start state and one accepting state

Start state has no incoming transitions Accepting state has no outgoing

transitions

49

From reg. exp. to automata Theorem: there is an algorithm to

build an NFA+Є automaton for any regular expression

Proof: by induction on the structure of the regular expression

start

50

Base cases

R =

R = a

start

start a

51

Construction for R1 | R2

start

R1

R2

52

Construction for R1 R2

start

R1 R2

53

Construction for R*

start

R

54

From NFA+Є to DFA Construction requires O(n) states for

a reg-exp of length n Running an NFA+Є with n states on

string of length m takes O(m·n2) time Solution: determinization via subset

construction Number of states worst-case exponential

in n Running time O(m)

55

Subset construction For an NFA+Є with states M={s1,

…,sk} Construct a DFA with one state per

set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}

Simulate transitions between individual states for every letter as1 s2 a[s1,s4] [s2,s7]

NFA+Є DFA

as4 s7

56

Subset construction For an NFA+Є with states M={s1,

…,sk} Construct a DFA with one state per

set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}

Extend macro states by states reachable via Є transitions

Єs1 s4 [s1,s2] [s1,s2,s4]NFA+Є DFA

57

Scanning challenges Regular expressions allow us to define

the language of all sequences of tokens

Automata theory provides an algorithm for checking membership of words But we are interested in splitting the text

not just deciding on membership How do we determine lexemes? How do we handle ambiguities –

lexemes matching more than one token?

58

Separating lexemes ID = (a+b+…+z) (a+b+…+z)*

ONE= 1 Input: abb1 How do we identify ID(abb), ONE?

59

Separating lexemes ID = (a+b+…+z) (a+b+…+z)*

ONE= 1 Input: abb1 How do we identify ID(abb), ONE?

start

a-z

1

a-zID

ONE

60

Maximal munch ID = (a+b+…+z) (a+b+…+z)*

ONE= 1 Input: abb1 How do we identify ID(abb), ONE? Solution: find longest matching

lexeme Keep reading text until automaton leaves

accepting state Return token corresponding to accepting

state Reset – go back to start state and

continue reading input from there

61

Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*

IF = if Input: if Matches both tokens What should the scanner output?

start

a-z

i

a-zID

IFfNFA

62


IF = if Input: if Matches both tokens What should the scanner output?

start

a-z\i

i

a-zID

IF IDfID

a-z\f DFAa-z

63


IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of

definitions Output: ID(if)

start

a-z\i

i

a-zID

IF IDfID

a-z\fa-z

64

Handling ambiguities IF = if

ID = (a+b+…+z) (a+b+…+z)* Input: if Matches both tokens What should the scanner output? Solution: break tie using order of

definitions Output: IF

Conclusion: list keywordtoken definitions

before identifier definition

start

a-z\i

i

a-zID

IF IDfID

a-z\fa-z

65

Implementing scanners in practice

66

Implementing scanners Manual construction of automata +

determinization is Very tedious Error-prone Non-incremental

Fortunately there are tools that automatically generate code from a specification for most languages C: Lex, Flex

Java: JLex, JFlex

67

Using JFlex Define tokens (and states) Run Jflex to generate Java

implementation Usually MyScanner.nextToken() will

be called in a loop by parser

RegularExpressions JFlex MyScanner.java

Stream of characters

Tokens

MyScanner.lex

68

Common format for reg-exps

Basic Patterns Matchingx The character x. Any character, usually except a new line[xyz] Any of the characters x,y,zRepetition OperatorsR? An R or nothing (=optionally an R)R* Zero or more occurrences of RR+ One or more occurrences of RComposition OperatorsR1R2 An R1 followed by R2R1|R2 Either an R1 or R2Grouping(R) R itself

69

Escape characters What is the expression for one or

more + symbols? (+)+ won’t work (\+)+ will

backslash \ before an operator turns it to standard character

\*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t

70

Shorthands Use names for expressions

letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)*

Use hyphen to denote a range letter = a-z | A-Z digit = 0-9

71

Catching errors What if input doesn’t match any

token definition? Trick: Add a “catch-all” rule that

matchesany character and reports an error Add after all other rules

72

Next lecture: parsing

winter 2012-2013 compiler principles lexical analysis (scanning)

Documents