winter 2012-2013 compiler principles lexical analysis (scanning)
DESCRIPTION
Winter 2012-2013 Compiler Principles Lexical Analysis (Scanning). Mayer Goldberg and Roman Manevich Ben-Gurion University. General stuff. Topics taught by me Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation - PowerPoint PPT PresentationTRANSCRIPT
Winter 2012-2013Compiler Principles
Lexical Analysis (Scanning)
Mayer Goldberg and Roman ManevichBen-Gurion University
2
General stuff Topics taught by me
Lexical analysis (scanning) Syntax analysis (parsing) … Dataflow analysis Register allocation
Slides will be available from web-site after lecture
Request: please mute mobiles, tablets, super-cool squeaking devices
3
Today Understand role of lexical
analysis
Lexical analysis theory
Implementing modern scanner
4
Role of lexical analysis First part of compiler front-end
Convert stream of characters into stream of tokens Split text into most basic meaningful
strings Simplify input for syntax analysis
High-level
Language
(scheme)
Executable Code
LexicalAnalysi
s
Syntax Analysi
sParsing
AST Symbol
Tableetc.
Inter.Rep.(IR)
CodeGeneration
5
From scanning to parsing5 + (7 * x)
) id * num ( + num
Lexical Analyzer
program text
token stream
Parser
Grammar:E id E numE E + EE E * EE ( E ) +
num
num x
*
Abstract Syntax Tree
validsyntaxerror
6
Javascript example
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
Identify basic units in this code
7
Javascript example
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
Identify basic units in this code
8
Javascript example Identify basic units in this code
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications", "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
keyword numeric literaloperator
string literal
punctuation
identifierwhitespace
9
Scanner output
var currOption = 0;// Choose content to display in lower pane.function choose ( id ) { var menu = ["about-me", "publications“, "teaching", "software", "activities"];
for (i = 0; i < menu.length; i++) { currOption = menu[i]; var elt = document.getElementById(currOption); if (currOption == id && elt.style.display == "none") { elt.style.display = "block"; } else { elt.style.display = "none"; } }}
1: VAR1: ID(currOption)1: EQ1: INT_LITERAL(0)1: SEMI3: FUNCTION3: ID(choose)3: LP3: ID(id)3: EP3: LCB...
Stream of TokensLINE: ID(value)
10
What is a token? Lexeme – substring of original text
constituting an identifiable unit Identifiers, Values, reserved words, …
Record type storing: Kind Value (when applicable) Start-position/end-position Any information that is useful for the
parser Different for different languages
11
C++ example 1 Splitting text into tokens can be tricky How should the code below be split?
vector<vector<int>> myVector
>>operator
>, >two tokensor ?
12
C++ example 2 Splitting text into tokens can be tricky How should the code below be split?
vector<vector<int> > myVector
>, >two tokens
Example tokensType ExamplesIdentifier x, y, z, foo, barNUM 42FLOATNUM -3.141592654STRING “so long, and thanks for all the fish”LPAREN (RPAREN )IF if…
13
14
Separating tokens
Type ExamplesComments /* ignore code */
// ignore until end of lineWhite spaces \t \n
Lexemes are recognized but get consumed rather than transmitted to parser if
i fi/*comment*/f
15
Preprocessor directives in C
Type ExamplesInlude directives #include<foo.h>Macros #define THE_ANSWER 42
16
Designing a scanner Define each type of lexeme
Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings
But how do we define lexemes of unbounded length?
17
Designing a scanner Define each type of lexeme
Reserved words: var, if, for, while Operators: < = ++ Identifiers: myFunction Literals: 123 “hello” Annotations: @SuppressWarnings
But how do we define lexemes of unbounded length? Regular expressions
18
Regular languages refresher Formal languages
Alphabet = finite set of letters Word = sequence of letter Language = set of words
Regular languages defined equivalently by Regular expressions Finite-state automata
19
Regular expressions Empty string: Є Letter: a Concatenation: R1 R2 Union: R1 | R2 Kleene-star: R*
Shorthand: R+ stands for R R* scope: (R) Example: (0* 1*) | (1* 0*)
What is this language?
20
Exercise 1 - Question Language of Java identifiers
Identifiers start with either an underscore ‘_’or a letter
Continue with either underscore, letter, or digit
21
Exercise 1 - Answer Language of Java identifiers
Identifiers start with either an underscore ‘_’or a letter
Continue with either underscore, letter, or digit
(_|a|b|…|z|A|…|Z)(_|a|b|…|z|A|…|Z|0|…|9)* Using shorthand macros
First = _|a|b|…|z|A|…|ZNext = First|0|…|9R = First Next*
22
Exercise 2 - Question Language of rational numbers in
decimal representation (no leading, ending zeros) 0 123.757 .933333 Not 007 Not 0.30
23
Exercise 2 - Answer Language of rational numbers in
decimal representation (no leading, ending zeros)
Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg
24
Exercise 3 - Question Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], …
25
Exercise 3 - Answer Equal number of opening and closing
parenthesis: [n]n = [], [[]], [[[]]], … Not regular Context-free Grammar:
S ::= [] | [S]
26
Finite automata
start
a
b
b
c
acceptingstate
startstate
transition
An automaton is defined by states and transitions
27
Automaton running example
start
a
b
b
c
Words are read left-to-rightc b a
28
Automaton running example
start
a
b
b
c
Words are read left-to-rightc b a
29
Automaton running example
start
a
b
b
c
Words are read left-to-rightc b a
30
Automaton running example
start
a
b
b
c
Words are read left-to-rightword
acceptedc b a
31
Word outside of language
start
a
b
b
c
c b b
32
Word outside of language Missing transition means non-
acceptance
start
a
b
b
c
c b b
33
Exercise - Question What is the language defined by the
automaton below?
start
a
b
b
c
34
Exercise - Answer What is the language defined by the
automaton below? a b* c Generally: all paths leading to accepting
states
start
a
b
b
c
35
Non-deterministic automata Allow multiple transitions from given
state labeled by same letter
start
a
a
b
c
b
c
36
NFA run example
c b a
start
a
a
b
c
b
c
37
NFA run example Maintain set of states
c b a
start
a
a
b
c
b
c
38
NFA run example
c b a
start
a
a
b
c
b
c
39
NFA run example Accept word if any of the states in the
set is acceptingc b a
start
a
a
b
c
b
c
40
NFA+Є automata Є transitions can “fire” without
reading the input
start a
b
c
Є
41
NFA+Є run example
start a
b
c
c b a
Є
42
NFA+Є run example Now Є transition can non-
deterministically take place
start a
b
c
c b a
Є
43
NFA+Є run example
start a
b
c
c b a
Є
44
NFA+Є run example
start a
b
c
c b a
Є
45
NFA+Є run example
start a
b
c
c b a
Є
46
NFA+Є run example
start a
b
c
c b a
Є
Word accepted
47
Reg-exp vs. automata Regular expressions are declarative
Offer compact way to define a regular language by humans
Don’t offer direct way to check whether a given word is in the language
Automata are operative Define an algorithm for deciding whether
a given word is in a regular language Not a natural notation for humans
48
From reg. exp. to automata Theorem: there is an algorithm to
build an NFA+Є automaton for any regular expression
Proof: by induction on the structure of the regular expression For each sub-expression R we build an
automaton with exactly one start state and one accepting state
Start state has no incoming transitions Accepting state has no outgoing
transitions
49
From reg. exp. to automata Theorem: there is an algorithm to
build an NFA+Є automaton for any regular expression
Proof: by induction on the structure of the regular expression
start
50
Base cases
R =
R = a
start
start a
51
Construction for R1 | R2
start
R1
R2
52
Construction for R1 R2
start
R1 R2
53
Construction for R*
start
R
54
From NFA+Є to DFA Construction requires O(n) states for
a reg-exp of length n Running an NFA+Є with n states on
string of length m takes O(m·n2) time Solution: determinization via subset
construction Number of states worst-case exponential
in n Running time O(m)
55
Subset construction For an NFA+Є with states M={s1,
…,sk} Construct a DFA with one state per
set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Simulate transitions between individual states for every letter as1 s2 a[s1,s4] [s2,s7]
NFA+Є DFA
as4 s7
56
Subset construction For an NFA+Є with states M={s1,
…,sk} Construct a DFA with one state per
set of states of the corresponding NFA M’={ [], [s1], [s1,s2], [s2,s3], [s1,s2,s3], …}
Extend macro states by states reachable via Є transitions
Єs1 s4 [s1,s2] [s1,s2,s4]NFA+Є DFA
57
Scanning challenges Regular expressions allow us to define
the language of all sequences of tokens
Automata theory provides an algorithm for checking membership of words But we are interested in splitting the text
not just deciding on membership How do we determine lexemes? How do we handle ambiguities –
lexemes matching more than one token?
58
Separating lexemes ID = (a+b+…+z) (a+b+…+z)*
ONE= 1 Input: abb1 How do we identify ID(abb), ONE?
59
Separating lexemes ID = (a+b+…+z) (a+b+…+z)*
ONE= 1 Input: abb1 How do we identify ID(abb), ONE?
start
a-z
1
a-zID
ONE
60
Maximal munch ID = (a+b+…+z) (a+b+…+z)*
ONE= 1 Input: abb1 How do we identify ID(abb), ONE? Solution: find longest matching
lexeme Keep reading text until automaton leaves
accepting state Return token corresponding to accepting
state Reset – go back to start state and
continue reading input from there
61
Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*
IF = if Input: if Matches both tokens What should the scanner output?
start
a-z
i
a-zID
IFfNFA
62
Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*
IF = if Input: if Matches both tokens What should the scanner output?
start
a-z\i
i
a-zID
IF IDfID
a-z\f DFAa-z
63
Handling ambiguities ID = (a+b+…+z) (a+b+…+z)*
IF = if Input: if Matches both tokens What should the scanner output? Solution: break tie using order of
definitions Output: ID(if)
start
a-z\i
i
a-zID
IF IDfID
a-z\fa-z
64
Handling ambiguities IF = if
ID = (a+b+…+z) (a+b+…+z)* Input: if Matches both tokens What should the scanner output? Solution: break tie using order of
definitions Output: IF
Conclusion: list keywordtoken definitions
before identifier definition
start
a-z\i
i
a-zID
IF IDfID
a-z\fa-z
65
Implementing scanners in practice
66
Implementing scanners Manual construction of automata +
determinization is Very tedious Error-prone Non-incremental
Fortunately there are tools that automatically generate code from a specification for most languages C: Lex, Flex
Java: JLex, JFlex
67
Using JFlex Define tokens (and states) Run Jflex to generate Java
implementation Usually MyScanner.nextToken() will
be called in a loop by parser
RegularExpressions JFlex MyScanner.java
Stream of characters
Tokens
MyScanner.lex
68
Common format for reg-exps
Basic Patterns Matchingx The character x. Any character, usually except a new line[xyz] Any of the characters x,y,zRepetition OperatorsR? An R or nothing (=optionally an R)R* Zero or more occurrences of RR+ One or more occurrences of RComposition OperatorsR1R2 An R1 followed by R2R1|R2 Either an R1 or R2Grouping(R) R itself
69
Escape characters What is the expression for one or
more + symbols? (+)+ won’t work (\+)+ will
backslash \ before an operator turns it to standard character
\*, \?, \+, … Newline: \n or \r\n depending on OS Tab: \t
70
Shorthands Use names for expressions
letter = a | b | … | z | A | B | … | Z letter_ = letter | _ digit = 0 | 1 | 2 | … | 9 id = letter_ (letter_ | digit)*
Use hyphen to denote a range letter = a-z | A-Z digit = 0-9
71
Catching errors What if input doesn’t match any
token definition? Trick: Add a “catch-all” rule that
matchesany character and reports an error Add after all other rules
72
Next lecture: parsing