edan65: compilers, lecture02 regularexpressions and...

44
EDAN65: Compilers, Lecture 02 Regular expressions and scanning Görel Hedin Revised: 2017-08-29

Upload: others

Post on 20-Jun-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

EDAN65:Compilers,Lecture 02

Regular expressionsandscanning

GörelHedinRevised:2017-08-29

Page 2: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Courseoverview

Semantic analyzer

Intermediatecode generator

Optimizer

Targetcodegenerator

2EDAN65,Lecture02

Lexical analyzer(scanner)

Syntactic analyzer(parser)

Regularexpressions

Context-freegrammar

Attributegrammar

machine

runtime system

stack

heap

codeanddata

objects

activationrecords

Interpreter

target code

tokens

Attributed AST

intermediate code

sourcecode (text)

AST(Abstractsyntaxtree)

intermediate code

garbagecollection

Virtualmachine

This lecture

Page 3: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Analyzing programtext

EDAN65,Lecture02 3

sum =sum +k

AssignStmt

Exp

Add

Exp Exp

IDEQIDPLUSIDprogramtext

tokens

parse tree

This lecture

Page 4: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Recall:Generatingthecompiler:

Semantic analyzer

EDAN65,Lecture02

Lexical analyzer(scanner)

Syntactic analyzer(parser)

Regularexpressions Scannergenerator

Context-freegrammar

Parsergenerator

Attributegrammar

Attribute evaluatorgenerator

We will use ascannergeneratorcalled JFlex

4

tokens

text

tree

Page 5: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Some typical tokens

EDAN65,Lecture02 5

Token Example lexemes

IFTHENFOR

ifthenfor

ID B alpha k10

INTFLOATSTRINGCHAR

12309920163.14160.2"Hello""""100%"'A''c' '%'

PLUSINCRNE

+++!=

SEMICOMMALPAREN

;,(

Regular expression"if""then""for"[A-Za-z][A-Za-z0-9]*

[0-9]+[0-9]+ "." [0-9]+\" [^\"]* \"\' [^\'] \'"+""++""!="";"",""("

JFlex syntax

Reserved words(keywords)

Identifiers

Literals

Operators

Separators

Page 6: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Formallanguages• Analphabet,Σ,isasetof symbols(nonempty andfinite).• Astring isasequence of symbols(each stringisfinite)• Aformallanguage,L,isasetof strings(can beinfinite).

• We would liketo have rules oralgorithms fordefining alanguage – deciding if acertain stringoverthealphabetbelongs to thelanguage ornot.

EDAN65,Lecture02 6

Page 7: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example:Languages overbinary numbers

Suppose we have thealphabet Σ ={0,1}

Example languages:• Thesetof allpossiblecombinationsofzerosandones:

L0 ={0,1,00,01,10,11,000,...}• Allbinarynumberswithoutunnecessaryleadingzeros:

L1 ={0,1,10,11,100,101,110,111,1000,...}• Allbinarynumberswithtwodigits:

L2 ={00,01,10,11}• ...

EDAN65,Lecture02 7

Page 8: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example:Languages overUNICODE

Here,thealphabet Σ isthesetof UNICODEcharacters

Example languages:• Allpossible Javakeywords:{"class","import","public",...}• Allpossible lexemes corresponding to Javatokens.• Allpossible lexemes corresponding to Javawhitespace.• Allbinary numbers• ...

EDAN65,Lecture02 8

Page 9: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example:Languages overJavatokens

Here,thealphabet Σ isthesetof Javatokens

Example languages:• Allsyntactically correct Javaprograms• Allthat are syntactically incorrect• Allthat are compile-time correct• Allthat terminate• ...

EDAN65,Lecture02 9

(But this language cannot becomputed:Terminationisundecidable:itisnotpossible to construct analgorithm that decides forany string,ifitisaterminating programornot.)

Page 10: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Defining languages using rulesIncreasingly powerful:• Regular expressions(fortokens)• Context-free grammars(forsyntaxtrees)• Attribute grammars(context-free grammar +extrarules for

further restricting thelanguage)

EDAN65,Lecture02 10

Page 11: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Regular expressions(core notation)RE read iscalled

a a symbol

M |N M orN alternative

MN M followed byN concatenation

∊ theempty string epsilon

M* zero ormoreM repetition(Kleene star)

(M)

EDAN65,Lecture02 11

where a isasymbolinthealphabet (e.g.,{0,1}orUNICODE)andM andN are regular expressions

Each regular expressiondefines alanguage overthealphabet(asetof stringsthat belong to thelangauge).

Priorities:M |N P*means M |(N (P*))

Page 12: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example

a |b c*

means

{a,b,bc,bcc,bccc,...}

EDAN65,Lecture02 12

Page 13: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Regular expressions(extended notation)Core RE read iscalled

a a symbol

M |N M orN alternative

MN M followed byN concatenation

∊ theempty string epsilon

M* zero ormoreM repetition(Kleene star)

(M)

EDAN65,Lecture02 13

Extended RE read meansM+ at least one ... MM*

M? optional ... ∊ |M[aou][a-zA-Z]

one of ...(a character class) a|o|ua|b| ...|z|A|B|...|Z

[^0-9](Appel notation:~[0-9])

not... one character,but notanyone of those listed

"a+b" thestring... a\+b

Page 14: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

ExerciseWrite aregular expressionthat defines thelanguage of alldecimalnumbers,like

3.140.7547110...

But notnumbers lacking aninteger part.Andnotnumbers with adecimalpoint butlacking afractional part.Sonotnumbers like

17..236.

Leadingandtrailing zeros are allowed.Sothefollowing are ok:

007008.000.01.700

a) Use theextended notation.b) Then translatetheexpressionto thecore notationc) Then write anexpressionthat disallows unnecessary leadingzeros

(intheextended notation)

EDAN65,Lecture02 14

Page 15: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Solutiona)[0-9]+ ("."[0-9]+)?

b)(0 |...| 9)(0 |...| 9)* (∊ | ("."((0 |...| 9)(0 |...| 9)*)))

c)(0 | [1-9] [0-9]*) ("."[0-9]+)?

EDAN65,Lecture02 15

Page 16: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Escaped characters

EDAN65,Lecture02 16

Use backslashto escape metacharacters andnon-printing control characters.

Metacharacters

\+

\*

\(

\)

\|

\\

...

Non-printing control characters

\n newline

\r return

\t tab

\f formfeed

...

Page 17: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Some typical tokens

EDAN65,Lecture02 17

Kind Name Example lexemes

Reserved words(keywords)

IFTHENFOR

ifthenfor

Identifiers ID B alpha k10

Literals INT 123099

FLOAT 3.14160.2

CHAR 'A''c'

STRING "Hello""""j"

Operators PLUSINCRNE

+++!=

Separators SEMICOMMALPAREN

;,(

Regular expression"if""then""for"[A-Za-z]([A-Za-z0-9])*

[0-9]+

[0-9]+ "." [0-9]+

\' [^\'] \'

\" [^\"]* \"

"+""++""!="";"",""("

Page 18: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Some typical non-tokens

EDAN65,Lecture02 18

Non-Token Example lexemes

WHITESPACE blank tab newlinereturn

ENDOFLINECOMMENT //comment

Regular expression(jflex)" " | \t | \n | \r

"//" [^\n\r]* ([\n\r])?

Non-tokensare also recognized bythescanner,justliketokens.But they are notsentonto theparser.

JFlex syntax

(Thenewline/return ending anend-of-line comment isoptional inorderto allow afile to endwith anend-of-line comment,without anextranewline/return.)

Page 19: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

JFlex:AscannergeneratorGeneratingascannerforalanguage lang

EDAN65,Lecture02 19

Program.lang

LangScanner.java

LangParser.java

characters

tokens

lang.jflex jflex.jar

Scannerspecification withregular exprs

Scannergenerator

Page 20: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

AJFlex specification

EDAN65,Lecture02 20

package lang; // the generated scanner will belong to the package langimport lang.Token; // Our own class for tokens...

// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }

// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...

Rules andlexical actionsEach rule hastheform:

regular-expression {lexical action }Thelexical actionconsists of arbitrary Javacode.Itisrun when aregular expressionismatched.Themethod yytext()returns thelexeme (thetokenvalue).

What rules are used whenscanning"a<b"?

Page 21: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Ambiguities?

EDAN65,Lecture02 21

package lang; // the generated scanner will belong to the package langimport lang.Token; // Class for tokens...

// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }

// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...

Are thetokendefinitionsambiguous?Which rules match"<="?Which rules match"if"?Which rules match"ifff"?Which rules match"xyz"?

Page 22: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Extrarules forresolving ambiguities

Longest matchIfone rule can beused to matchatoken,but there isanother rulethat will matchalonger token,thelatter rule will bechosen.This way,thescannerwill matchthelongest tokenpossible.

Rule priorityIftwo rules can beused to matchthesamesequence of characters,thefirst one takes priority.

EDAN65,Lecture02 22

Page 23: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Implementationof scannersObservation:

Regular expressions are equivalent to finite automata (finite-state machines).(They can recognize thesameclass of formallanguages:theregular languages.)

Overallapproach:• Translateeach tokenregular expressionto afinite automaton.

Label thefinalstate with thetoken.• Merge alltheautomata.• Theresulting automaton will ingeneralbenondeterministic• Translatethenondeterministic automaton to adeterministic automaton.• Implement thedeterministic automaton,

either using switchstatements oratable.

Ascannergeneratorautomates this process.

EDAN65,Lecture02 23

Page 24: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Construct anautomaton foreach tokenregexp

EDAN65,Lecture02 24

a

state

transition

startstate

finalstate

fi IF

0-9 INT

0-9

"" WHITESPACE

\n

\t

WHITESPACE

WHITESPACE

a-zA-Z ID

a-zA-Z

"if"

[0-9]+

""|\n|\t

[a-zA-Z]+

Page 25: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Merge thestartstates of theautomata

EDAN65,Lecture02

f

i

IF

0-9 INT

0-9

""\n\t

WHITESPACE

a-zA-Z

ID

a-zA-Z

Isthenewautomaton deterministic?

25

Page 26: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Deterministic finite automata

EDAN65,Lecture02 26

1

a 2

3a

2

1

a 2

3b

Inadeterministic finite automaton each transition isuniquely determined bytheinput.

Nondeterministic,since if we readawhen instate 1,we don't know if we should goto state 2or3.

Nondeterministic,since when we are instate 1,we don'tknow if we should stay there,orgoto state 2withoutreading any input.(Epsilondenotes theempty string.)

Deterministic,since fromstate 1,thenext inputdetermines if we goto state 2or3.

Page 27: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

DFAversus NFADeterministic Finite Automaton (DFA)Afinite automaton isdeterministic if

– alloutgoing edges fromany givenstate have disjointcharacter sets– there are noepsilonedges

Can beimplemented efficiently

Non-deterministic Finite Automaton (NFA)AnNFAmay have

– two outgoing edges with overlapping character sets– epsilonedges

Every DFAisalso anNFA.Every NFAcan betranslated to anequivalent DFA.

EDAN65,Lecture02 27

Page 28: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Translating anNFAto aDFASimulate theNFA– keep track of aset of current NFA-states– follow ε edges to extend thecurrent set(take theclosure)

Construct thecorresponding DFA– Each such set of NFAstates corresponds to one DFAstate– Ifany of theNFAstates isfinal,theDFAstate isalso final,andismarked with thecorresponding token.

– Ifthere ismore than one tokento choose from,select thetokenthat isdefined first (rule priority).

(Minimize theDFAforefficiency)

EDAN65,Lecture02 28

Page 29: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example

EDAN65,Lecture02 29

2 3f

iIF

1

4

a-z

ID

a-z

NFA

3,4f

iIF

1

4

a-hj-zID

a-z

DFA

a-za-eg-z

2,4ID

Page 30: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Error handling

EDAN65,Lecture02 30

3f

iIF

1

4a-hj-z

ID

a-za-z

a-eg-z

0

ERROR

^a-z^a-z ^a-z^a-z

• Add a"dead state"(state 0),corresponding to erroneous input.• Add transitions to the"dead state"forallerroneous input.• Generate an"ERRORtoken"when thedead state isreached.

2

ID

Page 31: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

ImplementationalternativesforDFAs

Table-driven– Represent theautomaton byatable– Additional tableto keep track of finalstates andtokenkinds– Aglobalvariable keeps track of thecurrent state

Switchstatements– Each state isimplemented asaswitchstatement– Each case implements astate transition asajump (to another switch

statement)– Thecurrent state isrepresented bytheprogramcounter.

EDAN65,Lecture02 31

Page 32: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Table-drivenimplementation

EDAN65,Lecture02 32

... + ... a ... e f g ... h i j ... z ... final kind

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true ERROR

1 0 5 0 4 4 4 4 4 4 4 2 4 4 4 0 false

2 0 0 0 4 4 4 3 4 4 4 4 4 4 4 0 true ID

3 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true IF

4 0 0 0 4 4 4 4 4 4 4 4 4 4 4 0 true ID

5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 true PLUS

3f

iIF

1

4a-hj-z

ID

a-za-z

a-eg-z

2ID

5PLUS

+

Page 33: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Scannerimplementation,design

EDAN65,Lecture02 33

ParserScanner

TokennextToken()

File

charnextChar()

Token

int kind()Stringvalue()

call call

Page 34: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Scannerimplementation,sketch

EDAN65,Lecture02 34

Token nextToken() {state = 1; // start statewhile (! isFinal[state]) {

ch = file.readChar();state = edges[state, ch];

}return new Token(kind[state]);

}

Needs to beextended with handlingof:• longest match• endof file• nontokens(likewhitespace)• tokenvalues (liketheidentifier name)

Idea:Scanthenext tokenby• starting inthestartstate• scan characters until we reach afinalstate• return anewtoken

Page 35: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Extend to longest match,design

EDAN65,Lecture02 35

ParserScanner

TokennextToken()

PushbackFile

charreadChar()void pushback(String)

Token

int kind()Stringvalue()

File

charreadChar()

Idea:• When atokenismatched,don't stopscanning.• When theerror state isreached,return thelasttokenmatched.• Pushreadcharacters that are unused backinto thefile,sothey can bescanned again.• Use aPushbackFile to accomplish this.

Page 36: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Extend to handle longest match,sketch

EDAN65,Lecture02 36

Token nextToken() {state = 1;str = "";lastFinalState = 0; lastTokenValue = "";while (state != 0) {

ch = pushbackfile.readChar();str = str + ch; state = edges[state, ch];if (isFinal[state]) {

lastFinalState = state;lastTokenValue = str;

}}pushbackfile.pushback(str.substring(lastTokenValue.length));return new Token(kind[lastFinalState], lastTokenValue);

}

// In Java, StringBuilder would be more efficient

• When atokenismatched (afinalstate reached),don’t stopscanning.• Keep track of thecurrently scanned string,str.• Keep track of thelatest matched token(lastFinalState,lastTokenValue).• Continue scanninguntil we reach theerror state.• Restore theinputstream using PushBackFile.• Return thelatest matched token.• (orreturn theERRORtokenif there was nolatest matched token)

Page 37: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

HandlingEnd-of-file (EOF)andnon-tokens

EOF– construct anexplicitEOFtokenwhen theEOFcharacter isread

Non-tokens(Whitespace&Comments)– view astokensof aspecialkind– scan them asnormaltokens,but don’t create tokenobjects forthem– loopinnext()until arealtokenhasbeen found

Errors– construct anexplicitERRORtokento bereturned when novalidtoken

can befound.

EDAN65,Lecture02 37

Page 38: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Specifying EOFandERRORinJFlex

EDAN65,Lecture02 38

package lang; // the generated scanner will belong to the package langimport lang.Token; // Class for tokens...

// ignore whitespace" " | \t | \n | \r | \f { /* ignore */ }

// tokens"if" { return new Token("IF"); }"=" { return new Token("ASSIGN"); }"<" { return new Token("LT"); }"<=" { return new Token("LE"); }[a-zA-Z]+ { return new Token("ID", yytext()); }...<<EOF>> { return new Token("EOF"); }[^] { return new Token("ERROR"); }

<<EOF>>isaspecialregular expressioninJFlex,matching endof file.

[^]means any character.Due to rule priority,this will matchany character notmatched byprevious rules.

Page 39: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example scannergenerators

EDAN65,Lecture02 39

tool author generates

lex Schmidt, Lesk.1975 C-code

flex ("fast lex") Paxon.1987 C-code

jlex Javacode

jflex Javacode

...

Page 40: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Limitationsof regular expressionsforscanning

EDAN65,Lecture02 40

• Nested comments?• Layout-sensitivesyntax?• Context-sensitivetokendefinitions?

Forexample,multi-language documents.

• Two mechanisms inscannergeneratorsforworkarounds:– Lexical actions:

domore than create atoken,e.g.,count nesting levels of comments.– Lexical states:

switchbetween differentsetsof tokendefinitions.

Page 41: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Lexical states

EDAN65,Lecture02 41

• Some tokensare difficult orimpossible to define with regular expressions.

• Lexical states (setsof tokenrules)give thepossibility to switchtokensets(DFAs)during scanning.

• Useful formulti-line comments,HTML,scanningmulti-languagedocuments,etc.

• Supported bymany scannergenerators(including JFlex)

T1T2T3T4...

LEXSTATE1T5T6T7...

LEXSTATE2

Page 42: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Example:multi-line comments

EDAN65,Lecture02 42

Would liketo scan thecomplete comment asone token:

/*int m() {

return 15 / 3 * 4 * 2;}*/

Can besolved easily with lexical states:

ID"if"

"then""/*"...

"*/"[^]

Defaulttokenset

Tokensetusedinsidecomment

However,some scannergenerators,likeJFlex,hasthespecialoperatorupto (~)thatcan beused instead: "/*" ~"*/" { /* Comment */ }

"/*"((\*+[^/*])|([^*]))*\**"*/"

Writinganordinary regular expressionforthis isdifficult:

Page 43: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Courseoverview

Semantic analyzer

Intermediatecode generator

Optimizer

Targetcodegenerator

43EDAN65,Lecture02

Lexical analyzer(scanner)

Syntactic analyzer(parser)

Regularexpressions

Context-freegrammar

Attributegrammar

machine

runtime system

stack

heap

codeanddata

objects

activationrecords

Interpreter

target code

tokens

Attributed AST

intermediate code

sourcecode (text)

AST(Abstractsyntaxtree)

intermediate code

garbagecollection

Virtualmachine

This lecture

Next lecture

A1

A1

Page 44: EDAN65: Compilers, Lecture02 Regularexpressions and scanningfileadmin.cs.lth.se/cs/Education/EDAN65/2017/lectures/L02.pdf · Analyzingprogram text EDAN65, Lecture 02 3 sum = sum +

Summaryquestions

44

• What isaformallanguage?• What isaregular expression?• What ismeant byanambiguous lexical definition?• Give some typical examples of ambiguities andhow they may beresolved.• What isalexical action?• Give anexample of how to construct anNFAforagivenlexical definition• Give anexample of how to construct aDFAforagivenNFA• What isthedifference between aDFAandandNFA?• Give anexample of how to implement aDFAinJava.• How isrule priority handledintheimplementation?Longest match?EOF?Whitespace?Errors?• What are lexical states?When are they useful?

EDAN65,Lecture02

You can startonAssignment 1now.But you will have to wait until thenext lectureforthepartsabout parsing.