cs375 compilers lexical analysis 4 th february, 2010

Compilers

Lexical Analysis4th February, 2010

Outline

• Overview of a compiler.• What is lexical analysis?• Writing a Lexer

– Specifying tokens: regular expressions

– Converting regular expressions to NFA, DFA

• Optimizations.

How It Works

Source code(character stream)

Lexical Analysis

Syntax Analysis(Parsing)

Tokenstream

Abstract syntaxtree (AST)

Semantic Analysis

if (b == 0) a = b;

( b ) a = b ;0==

Decorated AST

int b int 0

int alvalue

boolean int

Program representation

What is a lexical analyzer

• What?– Reads in a stream of characters and

groups them into “tokens” or “lexemes”.– Language definition describes what

tokens are valid.

• Why?– Makes writing the parser a lot easier,

parser operates on “tokens”.– Input dependent functionality such as

character codes, EOF, new line characters.

First Step: Lexical AnalysisSource code(character stream)

Lexical Analysis

Token stream

Semantic Analysis

if (b == 0) a = b;

if ( b ) a = b ;0==

Syntax Analysis

What it should do?

Token? {Yes/No}InputString

Token Description

We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.

STARTING OFF

Tokens

• Logical grouping of characters.• Identifiers: x y11 elsen _i00• Keywords: if else while break• Constants:

– Integer: 2 1000 -500 5L 0x777– Floating-point: 2.0 0.00020 .02 1. 1e5 0.e-10– String: ”x” ”He said, \”Are you?\”\n”– Character: ’c’ ’\000’– Symbols: + * { } ++ < << [ ] >=

• Whitespace (typically recognized and discarded): – Comment: /** don’t change this **/– Space: <space> – Format characters: <newline> <return>

Ad-hoc Lexer• Hand-write code to generate tokens• How to read identifier tokens?

Token readIdentifier( ) {String id = “”;while (true) {

char c = input.read();if (!identifierChar(c))

return new Token(ID, id, lineNumber);id = id + String(c);

• Problems– How to start?– What to do with following character?– How to avoid quadratic complexity of repeated

concatenation?– How to recognize keywords?

• Scan text one character at a time• Use look-ahead character (next) to

determine what kind of token to read and when the current token endschar next;…while (identifierChar(next)) {

id = id + String(next);next = input.read ();

Look-ahead Character

e l s e n

next(lookahead)

Ad-hoc Lexer: Top-level Loop

class Lexer {InputStream s;char next;Lexer(InputStream _s) { s = _s; next = s.read(); }Token nextToken( ) {if (identifierFirstChar(next))//starts with a charreturn readIdentifier(); //is an identifierif (numericFirstChar(next)) //starts with a numreturn readNumber(); //is a numberif (next == ‘\”’) return readStringConst();…}

Problems

• Might not know what kind of token we are going to read from seeing first character– if token begins with “i’’ is it an

identifier? (what about int, if )– if token begins with “2” is it an

integer constant?– interleaved tokenizer code hard to

write correctly, harder to maintain– in general, unbounded look-ahead

may be needed

Problems (cont.)

• How to specify (unambiguously) tokens.

• Once specified, how to implement them in a systematic way?

• How to implement them efficiently?

Problems (cont. )

• For instance, consider.– How to describe tokens unambiguously

2.e0 20.e-01 2.0000“” “x” “\\”

“\”\’”

– How to break up text into tokensif (x == 0) a = x<<1; if (x == 0) a = x<1;

– How to tokenize efficiently• tokens may have similar prefixes• want to look at each character ~1 time

Principled Approach

• Need a principled approach1.Lexer Generators

– lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator

2.Your own Lexer – Describe programming language’s

tokens with a set of regular expressions– Generate scanning automaton from

that set of regular expressions

Top level idea…

• Have a formal language to describe tokens.– Use regular expressions.

• Have a mechanical way of converting this formal description to code.– Convert regular expressions to finite

automaton (acceptors/state machines)

• Run the code on actual inputs.– Simulate the finite automaton.

An Example : Integers

• Consider integers. • We can describe integers using the

following grammar:• Num -> ‘-’ Pos• Num -> Pos• Pos ->0 | 1 |…|9• Pos ->0 | 1 |…|9 Pos

• Or in a more compact notation, we have:– Num-> -? [0-9]+

• Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0.

• We can also represent above regular expression as a state machine.– This would be useful in simulation of

the regular expression.

• The Non-deterministic Finite Automaton is as follows.

• We can verify that -123, 65, 0 are accepted by the state machine.

• But which path to take? -paths?

• The NFA can be converted to an equivalent Deterministic FA as below.– We shall see later how.

• It accepts the same tokens.– -123– 65– 0

{0,1} {1}

0-90-9

• The deterministic Finite automaton makes implementation very easier, as we shall see later.

• So, all we have to do is:– Express tokens as regular expressions– Convert RE to NFA– Convert NFA to DFA– Simulate the DFA on inputs

The larger picture…

RE NFAConversionNFA DFAConversion

DFASimulation

Yes, if w is valid token

No, if notInputString

Regular Expression describing tokens

QUICK LANGUAGE THEORY REVIEW…

Language Theory Review• Let be a finite set

called an alphabet– a called a symbol

* is the set of all finite strings consisting of symbols from

• A subset L * is called a language• If L1 and L2 are languages, then L1 L2 is

the concatenation of L1 and L2, i.e., the set of all pair-wise concatenations of strings from L1 and L2, respectively

Language Theory Review, ctd.

• Let L * be a language• Then

– L0 = {}– Ln+1 = L Ln for all n 0

• Examples– if L = {a, b} then

• L1 = L = {a, b}• L2 = {aa, ab, ba, bb}• L3 = {aaa, aab, aba, aba, baa, bab, bba,

bbb}• …

Syntax of Regular Expressions

• Set of regular expressions (RE) over alphabet is defined inductively by– Let a and R, S RE. Then:

• a RE• ε RE RE• R|S RE• RS RE• R* RE

• In concrete syntactic form, precedence rules, parentheses, and abbreviations

Semantics of Regular Expressions

• Regular expression T RE denotes the language L(R) * given according to the inductive structure of T:– L(a) ={a} the string “a”– L(ε) = {“”} the empty string– L() = {} the empty set– L(R|S) = L(R) L(S) alternation– L(RS) = L(R) L(S) concatenation– L(R*) = {“”} L(R) L(R2) L(R3) L(R4)

… Kleene closure

Simple Examples

• L(R) = the “language” defined by R– L( abc ) = { abc }– L( hello|goodbye ) = {hello, goodbye}

• OR operator, so L(a|b) is the language containing either strings of a, or strings of b.

– L( 1(0|1)* ) = all non-zero binary numerals beginning with 1

• Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.

Convienent RE ShorthandR+ one or more strings from L(R): R(R*)R? optional R: (R|ε)[abce] one of the listed characters: (a|b|c|e)[a-z] one character from this range:

(a|b|c|d|e|…|y|z)[^ab] anything but one of the listed chars[^a-z] one character not from this range”abc” the string “abc”\( the character ’(’. . .id=R named non-recursive regular expressions

More ExamplesRegular Expression R Strings in L(R)

digit = [0-9] “0” “1” “2” “3” …posint = digit+ “8” “412” …int = -? posint “-42” “1024”

…real = int ((. posint)?) “-1.56” “12”

“1.0” = (-|ε)([0-9]+)((. [0-9]+)|ε)[a-zA-Z_][a-zA-Z0-9_]* C identifierselse the keyword “else”

Historical Anomalies• PL/I

– Keywords not reserved• IF IF THEN THEN ELSE ELSE;

• FORTRAN– Whitespace stripped out prior to scanning

• DO 123 I = 1• DO 123 I = 1 , 2

• By and large, modern language design intentionally makes scanning easier

WRITING A LEXER

Writing a Lexer

• Regular Expressions can be very useful in describing languages (tokens).

– Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification.

– Have a systematic way of writing a Lexer from a specification such as regular expressions.

WRITING YOUR OWN LEXER

How To Use Regular Expressions

• Given R RE and input string w, need a mechanism to determine if w L(R)

• Such a mechanism is called an acceptor

Input string w(from the program)

R RE(that describes atoken family) ?

Yes, if w is a token

No, if w not a token

Acceptors• Acceptor determines if an input string

belongs to a language L

• Finite Automata are acceptors for languages described by regular expressions

Input String

Descriptionof language

Acceptor

Yes, if w L

No, if w L

Finite Automaton

Finite Automata• Informally, finite automaton consist of:

– A finite set of states– Transitions between states– An initial state (start state)– A set of final states (accepting states)

• Two kinds of finite automata:– Deterministic finite automata (DFA): the

transition from each state is uniquely determined by the current input character

– Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input

DFA Example

• Finite automaton that accepts the strings in the language denoted by regular expression ab*a

• Can be represented as a graph or a transition table.– A graph.

• Read symbol• Follow outgoing edge

0 1 2a

DFA Example (cont.)

• Representing FA as transition tables makes the implementation very easy.

• The above FA can be represented as :– Current state and current symbol

determine next state.– Until

• error state.• End of input.

a b0 1 Error1 2 12 Error Error

Simulating the DFA

transition_table[NumSTATES][NumCHARS]accept_states[NumSTATES] state = INITIAL

while (state != Error) {

c = input.read();if (c == EOF) break;state = trans_table[state][c];

}return (state!=Error) &&

accept_states[state];

0 1 2a

• Determine if the DFA accepts an input string

RE Finite automaton?• Can we build a finite automaton for every

regular expression?

• Strategy: build the finite automaton inductively, based on the definition of regular expressions

RE Finite automaton?

• Alternation R|S

• Concatenation: RS

• Recall ? implies optional move.

R automaton

S automaton

R automaton S automaton

NFA Definition• A non-deterministic finite automaton

(NFA) is an automaton where:– There may be ε-transitions (transitions that

do not consume input characters)– There may be multiple transitions from the

same state on the same input character

Example:

RE NFA intuition

-?[0-9]+

When to take the -path?

NFA construction (Thompson)

• NFA only needs one stop state (why?)• Canonical NFA form:

• Use this canonical form to inductively construct NFAs for regular expressions

Inductive NFA Construction

DFA vs NFA• DFA: action of automaton on each

input symbol is fully determined– obvious table-driven implementation

• NFA:– automaton may have choice on each

step– automaton accepts a string if there is

any way to make choices to arrive at accepting state

– every path from start state to an accept state is a string accepted by automaton

– not obvious how to implement!48

Simulating an NFA

• Problem: how to execute NFA?“strings accepted are those for which there is some corresponding path from start state to an accept state”

• Solution: search all paths in graph consistent with the string in parallel– Keep track of the subset of NFA states that

search could be in after seeing string prefix– “Multiple fingers” pointing to graph

Example

• Input string: -23

• NFA states:– Start:{0,1}– “-” :{1}– “2” :{2, 3}– “3” :{2, 3}

• But this is very difficult to implement directly.

NFA DFA conversion• Can convert NFA directly to DFA by same

approach• Create one DFA state for each distinct

subset of NFA states that could arise• States: {0,1}, {1}, {2, 3}

• Called the “subset construction”

{0,1} {1}

0-90-9

Algorithm• For a set S of states in the NFA, compute

ε-closure(S) = set of states reachable from states in S by one or more ε-transitions

T = SRepeat T = T U {s’ | sT, (s,s’) is ε-transition}Until T remains unchangedε-closure(S) = T

• For a set S of ε-closed states in the NFA, compute DFAedge(S,c) = the set of states reachable from states in S by transitions on symbol c and ε-transitions DFAedge(S,c) = ε-closure( { s’ | sS, (s,s’) is c-

transition} )

AlgorithmDFA-initial-state = ε-closure(NFA-initial-state)Worklist = { DFA-initial-state }

While ( Worklist not empty )Pick state S from WorklistFor each character c

S’ = DFAedge(S,c)if (S’ not in DFA states)

Add S’ to DFA states and worklist

Add an edge (S, S’) labeled c in DFA

For each DFA-state SIf S contains an NFA-final state

Mark S as DFA-final-state

Putting the Pieces Together

RE NFAConversionNFA DFAConversion

DFASimulation

Yes, if w L(R)

No, if w L(R)InputString

Regular Expression

OPTIMIZATIONS

State minimization

• State Minimization is an optimization that converts a DFA to another DFA that recognizes the same language and has a minimum number of states.– Divide all states into “equivalence”

groups.• Two states p and q are equivalent if for all

symbols, the outgoing edges either lead to error or the same destination group.

• Collapse the states of a group to a single state (instead of p and q, have a single state).

State Minimization (Equivalence)

• More formally, all states in group Gi are equivalent iff for any two states p and q in Gi, and for every symbol σ, transition(p,σ) and transition(q,σ) are either both Error, or are states in the same group Gj (possibly Gi itself).

• For example:

(or Error)

• Step1. Partition states of original DFA into maximal-sized groups of “equivalent” states S = {G1, … ,Gn}

• Step 2. Construct the minimized DFA such that there is a state for each group Gi

State Minimization

DFA Minimization• Step1. Partition states of original DFA into

maximal-sized groups of equivalent states– Step 1a. Discard states not reachable from start

state– Step 1b. Initial partition is S = {Final, Non-final}

– Step 1c. Repeatedly refine the partition {G1,…,Gn} while some group Gi contains states p and q such that for some symbol σ, transitions from p and q on σ are to different groups

Gi GkGj

(or Error)

j ≠ k

DFA Minimization• Step1. Partition states of original DFA into

maximal-sized groups of equivalent states– Step 1a. Discard states not reachable from start

state– Step 1b. Initial partition is S = {Final, Non-final}

– Step 1c. Repeatedly refine the partition {G1,…,Gn} while some group Gi contains states p and q such that for some symbol σ, transitions from p and q on σ are to different groups

(or Error)

j ≠ k

After state minimization.

• We have an optimized acceptor.

RE NFA

NFA DFA

DFASimulation

Yes, if w L(R)No, if w L(R)

InputString

Regular Expression

Minimize DFA

Lexical Analyzers vs Acceptors

• We really need a Lexer, not an acceptor.

• Lexical analyzers use the same mechanism, but they:– Have multiple RE descriptions for multiple

tokens– Output a sequence of matching tokens (or

an error)– Always return the longest matching token– For multiple longest matching tokens, use

rule priorities

Lexical Analyzers

RE NFANFA DFAMinimize DFADFASimulation

CharacterStream

REs for all valid Tokens

R1 … Rn

program Token stream(and errors)

Handling Multiple REs

whitespace

identifier

number

keywords

NFAsMinimized DFA

• Construct one NFA for each RE• Associate the final state of each NFA with the given RE• Combine NFAs for all REs into one NFA• Convert NFA to minimized DFA, associating each final

DFA state with the highest priority RE of the corresponding NFA states

Using Roll BackConsider three REs: {aa ba aabb] and input:

• Reach state 3 with no transition on next character a

• Roll input back to position on entering state 2 (i.e., having read aa)

• Emit token for aa• On next call to scanner, start in state 0 again

with input ba

0 1 4a a

Automatic Lexer Generators

• Input: token specification– list of regular expressions in priority order– associated action for each RE (generates

appropriate kind of token, other bookkeeping)

• Output: lexer program– program that reads an input stream and

breaks it up into tokens according to the REs (or reports lexical error -- “Unexpected character” )

Automatic Lexer (C) Generator

Example: Jlex (Java)%%digits = 0|[1-9][0-9]*letter = [A-Za-z]identifier = {letter}({letter}|[0-9_])*whitespace = [\ \t\n\r]+%%{whitespace} {/* discard */}{digits} { return new Token(INT,

Integer.parseInt(yytext()); }”if” { return new Token(IF, yytext()); }”while” { return new Token(WHILE, yytext()); }…{identifier} { return new Token(ID, yytext()); }

Example Output (Java)• Java Lexer which implements the

functionality described in the language specification.

• For instance : case 5: { return new Token(WHILE, yytext()); }

case 6: break; case 2: { return new Token(ID, yytext()); } case 7: break; case 4: { return new Token(IF, yytext());} case 8: break; case 1: { return new Token(INT,

Integer.parseInt(yytext());}

Start States

• Mechanism that specifies state in which to start the execution of the DFA

• Declare states in the second section– %state STATE

• Use states as prefixes of regular expressions in the third section:– <STATE> regex {action}

• Set current state in the actions– yybegin(STATE)

• There is a pre-defined initial state: YYINITIAL

Example

STRINGINITIAL .

%%%state STRING%%<YYINITIAL> “if” { return new Token(IF, null); }<YYINITIAL> “\”” { yybegin(STRING); … }<STRING> “\”” { yybegin(YYINITIAL); …

}<STRING> . { … }

Summary

• Lexical analyzer converts a text stream to tokens

• Ad-hoc lexers hard to get right, maintain • For most languages, legal tokens are

conveniently and precisely defined using regular expressions

• Lexer generators generate lexer automaton automatically from token RE’s, prioritization

Summary

• To write your own Lexer:– Describe tokens using Regular Expressions.– Construct NFAs for those tokens.

• If you have no ambiguities in the NFA, or you have a DFA directly from the regular expressions, you are done.

– Construct DFA from NFA using the algorithm described.

– Systematically implement the DFA using transition tables.

Reading• IC Language spec• JLEX manual• CVS manual• Links on course web home page• Regular Expression Matching Can Be Simple And Fast (but is

slow in Java, Perl, PHP, Python, Ruby, ...), Russ Cox, January 2007

AcknowledgementThe slides are based on similar content by Tim Teitelbaum, Cornell.

cs375 compilers lexical analysis 4 th february, 2010

Documents

ma/cs375 fall 2002 1 ma/cs 375 fall 2002 lecture 4

cs 321 programming languages and compilers lectures 16 & 17...

formal languages and compilers lecture vi: lexical...

lexical analysis and scanning honors compilers feb 5 th 2001...

cos 320 compilers david walker. the front end lexical...

compilers compilers . q1>a translator converts...

cs153: compilers lecture 3: lexical analysis...•lexical...

1 pass compiler 1. 1.introduction 1.1 types of compilers...

cs375: logic and theory of computing

lexical analysis with...

run-time requirements for compilers compilers spring 2001...

lexical gestures and lexical access: a process...

scanner: lexical...

lexical typology - eva.mpg.de · koch, lexical typology,...

lexical analysis with flex - iit...

honors compilers run-time support for compilers mar 21st...

241-437 compilers: lex analysis/2 1 compiler structures...

ma/cs375 fall 2002 1 ma/cs 375 fall 2002 lecture 6

lexical analysis - part 2 - iit...

the structure of compilers - compiler design lab,...