lexical analysis, i - rice university 412, fall 2017 2 the front end front end opmmizer back end...

23
Lexical Analysis, I Comp 412 COMP 412 FALL 2017 Copyright 2017, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educaMonal insMtuMons may use these materials for nonprofit educaMonal purposes, provided this copyright noMce is preserved. Front End OpMmizer Back End IR IR source code target code Chapter 2 in EaC2e

Upload: lehuong

Post on 11-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

LexicalAnalysis,I

Comp412

COMP412FALL2017

Copyright2017,KeithD.Cooper&LindaTorczon,allrightsreserved.StudentsenrolledinComp412atRiceUniversityhaveexplicitpermissiontomakecopiesofthesematerialsfortheirpersonaluse.FacultyfromothereducaMonalinsMtuMonsmayusethesematerialsfornonprofiteducaMonalpurposes,providedthiscopyrightnoMceispreserved.

FrontEnd OpMmizer BackEnd

IR IRsourcecode

targetcode

Chapter2inEaC2e

Page 2: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

AdjustedCalendar

COMP412,Fall2017 1

Lab1,AdjustedScheduleCodeCheck1 Monday,September11,2017

CodeCheck2 Monday,September18,2017

DueDateforCode Monday,September25,2017

LastDayforCode Monday,October2,2017

MidtermExam Wednesday,October18@7PM(unchanged)

Lab3,AdjustedSchedule

Lab3Available Friday,October20,2017

DueDateforCode Wednesday,November15,2017

LastDayforCode Wednesday,November22,2017

Page 3: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 2

TheFrontEnd

FrontEnd OpMmizer BackEnd

IR IRsourcecode

targetcode

Scannerlooksateverycharacter•  Convertsstreamofcharstostreamofclassifiedwords:–  <category,lexeme>–  SomeMmescallthispaira“token”

•  Efficiency&scalabilitymaber

Parserlooksateverytoken•  Determinesifthestreamoftokensformsasentenceinthesourcelanguage

•  FitstokenstosomesyntacMcmodel,orgrammar,forthesourcelanguage

streamofcharacters

FrontEnd

Scanner

Parser

SemanMcElaboraMon

microsyntax

syntax

IRannotaGons

streamoftokens

Page 4: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 3

TheFrontEnd

FrontEnd OpMmizer BackEnd

IR IRsourcecode

targetcode

Whyseparatescanning&parsing?•  PrimaryraMonaleisefficiency•  ScanneridenMfies&classifieswordsbytheirspelling–  Abstractsspellingintocategory

•  ParserconstructsderivaMons•  Parsingisharderthanscanning

Modernview(lesswidelyheld)•  Scanner-lessparsersaregainingpopularity,becausetheyeliminateonemoresetoftools– Maybewecanaffordtheoverhead–  Aliblemoreinvolved(SGLRparsers)

streamofcharacters

FrontEnd

Scanner

Parser

SemanMcElaboraMon

microsyntax

syntax

IRannotaGons

streamoftokens

Page 5: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 4

ImplementaMonStrategies

FrontEnd OpMmizer BackEnd

IR IRsourcecode

targetcode

HowdoweautomatetheconstrucGonofscanners&parsers?

Scanner•  Specifysyntaxwithregularexpressions(REs)

•  Constructfinite-automaton&scannerfromtheRE

Parser•  Specifysyntaxwithcontext-freegrammars(CFGs)

•  Constructpush-downautomaton&parserfromtheCFG

streamofcharacters

FrontEnd

Scanner

Parser

SemanMcElaboraMon

microsyntax

syntax

IRannotaGons

streamoftokens

Page 6: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

HowDoesClassRelatetoRegexLibraries?

•  Youwilllearnhowto“compile”REstoaDFA&implementaDFA–  ExecuMoncostisguaranteedO(1)perinputcharacter,independentoftheexpression

•  Youwillhavedeeperunderstandingoftheirpower&theiruse

COMP412,Fall2017 5

Regularexpressions(calledREs,orregexes,orregexpaberns)areessenMallyaMny,highlyspecializedprogramminglanguageembeddedinsidePythonandmadeavailablethroughtheremodule.…

RegularexpressionpabernsarecompiledintoaseriesofbytecodeswhicharethenexecutedbyamatchingenginewribeninC.Foradvanceduse,itmaybenecessarytopaycarefulaRenGontohowtheenginewillexecuteagivenRE,andwritetheREinacertainwayinordertoproducebytecodethatrunsfaster.OpGmizaGonisn’tcoveredinthisdocument,becauseitrequiresthatyouhaveagoodunderstandingofthematchingengine’sinternals.

TheregularexpressionlanguageisrelaMvelysmallandrestricted,sonotallpossiblestringprocessingtaskscanbedoneusingregularexpressions.Therearealsotasksthatcanbedonewithregularexpressions,buttheexpressionsturnouttobeverycomplicated.Inthesecases,youmaybebeberoffwriMngPythoncodetodotheprocessing;whilePythoncodewillbeslowerthananelaborateregularexpression,itwillalsoprobablybemoreunderstandable.

FromPython2.7.10documenta:on,emphasisadded

Page 7: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

InLecture2,wesawsomeambiguityindefining“posiGveinteger”•  Is001aposiMveinteger?Whatabout00?•  TheautomataareprecisespecificaMons,butthewordsarenot

WeneedabebernotaMonforspecifyingmicrosyntaxthanthesetransiMondiagrams.COMP412,Fall2017 6

BigPicture

ERRORse

Anycharacter

TransiMonstoseareimplicitfromeverystate

s0 s2

s3

0

1…9 0…9

TastefulPosiGveInteger(forbids001)

ERRORse

Anycharacter

TransiMonstoseareimplicitfromeverystate

s0 s20…9

0…9

TastelessPosiGveInteger(allows001)

Page 8: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 7

RegularExpressions

WeneedabeRernotaGonforspecifyingmicrosyntax

RegularExpressionsoveranAlphabetΣ•  Ifx∈Σ,thenxisanREdenoMngtheset{x}orthelanguageL={x}•  IfxandyareREsthen–  xyisanREdenoMngL(x)L(y)={pq|p∈L(x)andq∈L(y)}–  x|yisanREdenoMngL(x)∪L(y)–  x*isanREdenoMngL(x)*=∪0≤k<∞L(x)k (KleeneClosure)

➝  SetofallstringsthatarezeroormoreconcatenaEonsofx

–  x+isanREdenoMngL(x)+=∪1≤k<∞L(x)k (PosiEveClosure)➝  SetofallstringsthatareoneormoreconcatenaEonsofx(orxx*)

•  εisanREdenoMngtheemptyset

“beRer”⇒bothformalandconstrucMve

ManyRE-basedsystemssupportaddiMonalnotaMonandoperators.ThoseaddedfeaturesbuildonalternaMon,concatenaMon,andclosure—plus,perhapslogicalcomplementornegaMon.Complementiseasyandefficient,ifwethinkoftheunderlyingDFA.(Wewillrevisitthisissue.)

Page 9: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 8

RegularExpressions

Howdotheseoperatorshelp?

RegularExpressionsoveranAlphabetΣ•  IfxisinΣ,thenxisanREdenoMngtheset{x}orthelanguageL={x}

➝ ThespellingofanyleIerinthealphabetisanRE•  IfxandyareREsthen–  xyisanREdenoMngL(x)L(y)={pq|p∈L(x)andq∈L(y)}

➝  IfweconcatenateleAers,theresultisanRE,sowecanspellwords–  x|yisanREdenoMngL(x)∪L(y)

➝  AnyfinitelistofwordscanbewriAenasanRE,(w0|w1|w2|…|wn)–  x*isanREdenoMngL(x)*=∪0≤k<∞L(x)k–  x+isanREdenoMngL(x)+=∪1≤k<∞L(x)k

➝  Wecanuseclosuretowritefinitedescrip:onsofinfinite,butcountable,sets

•  εisanREdenoMngtheemptyset➝  εissome:mesusefulforwri:ngmoreconciseREs

TheoperatorsareconcatenaEon,alternaEon,andclosure.

Page 10: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 9

RegularExpressions

LetthenotaMon[a…z]beshorthandfortheRE(a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z)

ExamplesTastelessposiEveinteger [0…9][0…9]*

or [0…9]+

TastefulposiEveinteger 0|[1…9][0…9]*

IdenEfier(Algol-likelang) ([a…z]|[A…Z])([a…z]|[A…Z]|[0…9])*

Decimalnumber 0|[1…9][0…9]*.[0…9]*

Realnumber ((0|[1…9][0…9]*)|(0|[1…9][0…9]*.[0…9]*)E[0…9][0…9]*

EachoftheseREscorrespondstoaDFA.

Page 11: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 10

WhatIsThePoint?

Whydowecareaboutregularexpressionsinthecontextofacompiler?•  WeuseREstospecifythemappingofwordstopartsofspeech–  AnidenMfieris([a...z]|[A…Z])([a...z]|[A…Z]|[0…9])*–  Keywordsarespecifiedbytheirspellings,e.g.,if,then,else

•  WeusetoolsderivedfromautomatatheorytoconstructscannersdirectlyfromtheREs–  AutomaMcconstrucMonreducestheMme&costofscannerconstrucMon–  DerivaMonfromaformalnotaMoneliminatesimplementaMonerrors–  ResulMngscannersarebothefficient(O(n))andfast(lowconstantoverhead)

•  RE-derivedscannersarewidelyused–  Compilers,texteditors–  Inputcheckinginmanycontexts–  So}waretofilterorblockURLs

Wetypicallyaddsomespecialcharacters,e.g.,_#$@

Page 12: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 11

ADigressiononTime

InCOMP412,wewilltalkaboutalotof“Gmes”•  DesignMme,implementaMonMme,compileMme,runMme,…•  InpracMce,theissueofwhensomethinghappensisonethatcausesagreatdealofconfusionamongstudentsofcompilerconstrucMon–  DesignMmeandbuildMmehappenlongbeforecompilerruns

➝ CostsincurredatdesignorimplementaEonEmedonotincreasecompileEme–  CompileMmehappenseveryMmetheuserinvokesthecompiler

➝ Usersare,appropriately,sensiEvetocompileEme➝ CostsincurredatcompileEmedonotincreaserunEme

–  Run-MmecostsaffectactualapplicaMonperformance➝ OnecriEcalgoalforcompilaEonistokeeprunEmetoaminimum,whichmeans

reducingtheoverheadintroducedbytranslaEon

AswelookatstrategiesforgeneraEngscanners&parsers,keepinmindthatgeneraMoncostsareincurredatimplementaMonMme

(the“meta”issue)

Small#ofbuilds

Billionsofcompiles

manypercompile

Page 13: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 12

AutomaMcScannerConstrucMon

Goals•  SimplifytheconstrucMonofrobust,efficientscanners•  Developtechniquesthathavewidespreadapplicability•  Understandtheunderlyingtheory&pracMce

compileMmedesign&buildMmes

Scannersourcecode

streamof<word,category>pairs

ScannerGenerator

specificaGonswriRenasregularexpressions

knowledge

1.WewriteREsatdesignMme

3.Whenthecompilerruns,itusesthegeneratedscannertoconvertsourcecodeintoastreamoftokens.

e.g.,lex,flex

2.ToolsgeneratethescanneratbuildMme

Page 14: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 13

AutomaMcScannerConstrucMon

ScannerGenerator•  Mayencodeitsknowledgeintablesthatdrivea“skeletonscanner”–  SkeletonscannerinterpretsthetablestosimulatetheDFA

•  Everyscannerusesthesameskeleton•  ScannergeneratorbuildstheDFAfromtheRE,&convertsittoatable

sourcecode <word,category>pairs

ScannerGenerator

specificaGons(asREs)

Knowledgeencodedintablestodriveskeleton

SkeletonScanner Tables

See§2.5.1

Page 15: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 14

AutomaMcScannerConstrucMon

ScannerGenerator•  Mayencodeitsknowledgeoftherecognizerdirectlyintocode–  TransiMonsarecompiledintocondiMonallogic

•  Producesascannerthathasverylowoverheadpercharacter•  ScannergeneratorbuildstheDFAfromtheRE,&emitscodeforit

Scannersourcecode <word,category>pairs

ScannerGenerator

specificaGons(asREs)

Knowledgeembeddedingeneratedprogramtext

See§2.5.2

Page 16: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 15

ExamplefromLecture2

RecognizerforanILOCregistername(allowredundantzeros)

RulesforDFAOperaGon•  Startinstates0&maketransiMonsoneachinputcharacter

•  DFAacceptsawordxifandonlyifxleavestheDFAinafinalstate•  IftheDFAencountersacharacterwithnospecifiedtransiMon,itmovestose&staysinthatstate•  r17takesitthroughs0,s1,s2,s2anditaccepts•  rtakesitthroughs0,s1anditfails•  ratakesitthroughs0,s1,seanditfails

si

s1 s20…9

0…9

ERRORse

Anycharacter

TransiGonstoseareimplicitfromeverystate

s0r

Recognizerforr[0…9][0…9]*

WewillusetheREforaregisternameasaconMnuingexample.

Page 17: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 16

Example

Tobeuseful,theDFAmustbeexecutable

Foreachcharacter,theskeletonscannerdoesatablelookupandreadsthenextcharacter—bothofwhichshouldbeO(1)operaMons

char⇽nextcharacterstate⇽s0while(char≠EOF){state⇽δ[state,char]char⇽nextcharacter}if(stateisafinalstate)thenreportsuccesselsereportfailure

δ r 0,1,2,3,4,5,6,7,8,9

AnyOther

s0 s1 se se

s1 se s2 se

s2 se s2 se

se se se se

SkeletonScanner TransiGonTable(δ)

SkeletonScanner Tables

Thisskeletonscannerissimplified.SeeFigure2.14in§2.5.1ofEaC2e.

O(1)percharacter

Characterclassifiermapsanycharacterintooneofthe3classes:{r},{0…9},{allothers}

Page 18: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 17

Example

Tocaptureandclassifythelexeme,weaddaliRleworktoeachstate

char⇽nextcharacterstate⇽s0lexeme⇽nullstringwhile(char≠EOF){lexeme⇽lexeme||charstate⇽δ[state,char]char⇽nextcharacter}If(stateisafinalstate)then{category⇽f(state)return<lexeme,category>}elsereportfailure

SkeletonScanner

SkeletonScanner Tables

δ r 0,1,2,3,4,5,6,7,8,9

AnyOther

s0 s1 se se

s1 se s2 se

s2 se s2 se

se se se se

TransiGonTable(δ)SMllO(1)

Page 19: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 18

Example

Tocapturetheregisternumber,wewouldneedstate-specificacGons

char⇽nextcharacterstate⇽s0while(char≠EOF){state⇽δ[state,char]char⇽nextcharacterif(state=s1)n⇽0elseif(state=s2)n⇽n*10+char–‘0’}If(stateisafinalstate)then{category⇽f(state)return<lexeme,category>}elsereportfailure

SkeletonScanner Tables

δ r 0,1,2,3,4,5,6,7,8,9

AnyOther

s0 s1 se se

s1 se s2 se

s2 se s2 se

se se se se

TransiGonTable(δ)

s1 s20…9

0…9

s0r

IniGalizen Accumulaten

SMllO(1)

Page 20: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 19

MoreComplexREs

Whataboutamorecomplexlanguage?•  r[0…9][0…9]*allowsarbitraryregisternumbers (e.g.,r000orr999)•  Whatifwewanttolimittheregisternametor0throughr31?

WriteaMghterspecificaMonintotheRE•  r((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))•  r0|r1|r2|r3|…|r31|r00|r01|r02|…|r09

EachoftheseREscanbeconvertedtoaDFA•  TheDFAhasthesameO(1)costpertransiMon•  TheDFAtakesonetransiMonperinputcharacter•  TheDFAusesthesameskeletonscannerTheaddedcomplexityisintheRE,notinthescanner†

Non-standarduseof…butthemeaningisclear

WithascannergeneratedfromanRE,usingamorecomplexREincursnoaddiMonalcompileMme.

†recallthePythondocumentaMon

Page 21: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 20

MoreComplexREs

TheDFAforr((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))

•  Acceptsamoreconstrainedsetofregisternames•  Samecostperinputcharacter•  Morestates⇒morerowsinthetransiMontable⇒morememory

0…9

ERRORse

Anycharacter

TransiMonstoseareimplicitfromeverystate

s13 s5s0

r

s4

s2 s3

s60,1

0,1,2

4…9

Page 22: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 21

MoreComplexREs

TheDFAforr((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))

•  Acceptsamoreconstrainedsetofregisternames•  Samecostperinputcharacter•  Morestates⇒morerowsinthetransiMontable⇒morememory

0…9

ERRORse

Anycharacter

TransiMonstoseareimplicitfromeverystate

s13 s5s0

r

s4

s2 s3

s60,1

0,1,2

4…9

AutomataTheoryMomentEarlier,wesaidwewouldrevisitlogicalcomplementofanREoraDFA.TocomplementaDFA:

•  Makenon-finalstatesintofinalstates

•  Makefinalstatesintonon-finalstates

DFAthenacceptsanystringthattheoriginaldidnotaccept=>itscomplement

Page 23: Lexical Analysis, I - Rice University 412, Fall 2017 2 The Front End Front End OpMmizer Back End source IR IR code target code Scanner looks at every character • Converts stream

COMP412,Fall2017 22

MoreComplexREs

TheDFAforr((0|1|2)([0…9]|ε)|(4|5|6|7|8|9)|(3|30|31))

Thistablerunswithoutchangeinthesameskeletonscannerasthefirsttable•  Tochangethelanguage,justchangethetable•  SMllO(1)costpercharacter

δ r 0,1 2 3 4…9 AnyOthers

s0 s1 se se se se se

s1 se s2 s2 s5 s4 se

s2 se s3 s3 s3 s3 se

s3,s4 se se se se se se

s5 se s6 se se se se

s6 se se se se se se

se se se se se se se

NoMcethatthecharacterclassifierhasmanymoredivisionsthatdidtheearlierone.SMll,itshouldbeimplementableasafuncMonwithO(1)cost.(see§2.5)

Compressed2states,aswell