lexical analysis phung hua nguyen university of technology 2006

LEXICAL ANALYSIS

Phung Hua Nguyen

University of Technology

2006

Faculty of IT - HCMUT Lexical Analysis 2

Outline

• Introduction to Lexical Analysis• Token specification

– Language– Regular Expressions (REs)

• Token recoginition– REs NFA (Thompson’s construction, Algorithm 3.3)– NFA DFA (subset construction, Algorithm 3.2)– DFA minimal DFA (Algorithm 3.6)

• Programming


Introduction

• Read the input characters

• Produce as output a sequence of tokens

• Eliminate white space and comments

lexical analyzer

parser

symbol table

source program

token

get next token


Why ?

• Simplify design

• Improve compiler efficiency

• Enhance compiler portability


Tokens, Patterns, Lexemes

Token Sample Lexeme Informal description of patternconst const const

if if if

relation <,<=,==,!=,>,>= < or <= or == or != or > or >=

id pi, count, x2 letter followed by letters or digits

num 3.14, 25, 6.02E3 any numeric constant

literal “core dumped” any characters between “ and “ except “


Outline

• Introduction • Token specification



• Programming


Alphabet, Strings and Languages

• Alphabet ∑: any finite set of symbols– The Vietnamese alphabet {a, á, à, ả, ã, ạ, b, c, d, đ,…}– The binary alphabet {0,1}– The ASCII alphabet

• String: a finite sequence of symbols drawn from ∑ :– Length |s| of a string s: the number of symbols in s– The empty string, denoted , || = 0

• Language: any set of strings over ∑; – its two special cases:

: the empty set• {}


Examples of Languages

• ∑ ={a, á, à, ả, ã, ạ, b, c, d, đ,…}– Vietnamese language

• ∑ = {0,1}– A string is an instruction– The set of Pentium instructions

• ∑ = the ASCII set– A string is a program– The set of C programs


Terms (Fig.3.7)

Term Definitionprefix of s a string obtained by removing 0 or more trailing

symbols of s;e.g. ban is a prefix of banana

suffix of s a string formed by deleting 0 or more the leading symbols of s;e.g. na is a suffix of banana

substring of s a string obtained by deleting a prefix and a suffix from s;e.g. nan is a substring of banana

proper prefix, suffix or substring of s

Any nonempty string x that is, respectively, a prefix, suffix os substring of s such that s x


String operations

• String concatenation– If x and y are strings, xy is the string formed

by appending y to x.E.g.: x = hom, y = nay xy = homnay

is the identity: y = y; x = x

• String exponentiation– s0 = – si = si-1s

E.g. s = 01, s0 = , s2 = 0101, s3 = 010101


Language Operations (Fig 3.8)

Term Definition

union: L M L M = { s | s L or s M }

concatenation: LM LM= { st | s L or t M }

Kleene closure: L* L* = L0 L LL LLL …

where L0 = {}

0 or more concatenations of L

positive closure: L+ L+ = L LL LLL …

1 or more concatenations of L


Examples

• L = {A,B,…,Z,a,b,…,z}• D = {0,1,…,9}

Example Language

L D

LD

L4

L*

L(L D)*

D+

letters and digits

strings consists of a letter followed by a digit

all four-letter strings

all strings of letters, including

all strings of letters and digits beginning with a letter

all strings of one or more digits


Regular Expressions (Res) over Alphabet ∑

• Inductive base:1. is a RE, denoting the RL {}2. a ∑ is a RE, denoting the RL {a}

• Inductive step: Suppose r and s are REs, denoting the language L(r) and L(s). Then

3. (r)|(s) is a RE, denoting the RL L(r) L(s)4. (r)(s) is a RE, denoting the RL L(r)L(s)5. (r)* is a RE, denoting the RL (L(r))*6. (r) is a RE, denoting the RL L(r)


Precedence and Associativity

• Precedence:– “*” has the highest precedence– “concatenation” has the second highest precedence– “|” has the lowest precedence

• Associativity:– all are left-associative

E.g.: (a)|((b)*(c)) a|b*c

Unnecessary parentheses can be removed


Example

• ∑ = {a, b}

1. a|b denotes {a,b}

2. (a|b)(a|b) denotes {aa,ab,ba,bb}

3. a* denotes {,a,aa,aaa,aaaa,…}

4. (a|b)* denotes ?

5. a|a*b denotes ?


Notational Shorthands

• One or more instances +: r+ = rr*– denotes the language (L(r))+

– has the same precedence and associativity as *

• Zero or one instance ?: r? = r|– denotes the language (L(r) {})

• Character classes– [abc] denotes a|b|c– [A-Z] denotes A|B|…|Z– [a-zA-Z_][a-zA-Z0-9_]* denotes ?


Outline




• Programming


Overview

RE

NFA DFA mDFA

3.5

3.63.2

3.3


Nondeterministic finite automata

• A nondeterministic finite automaton (NFA) is a mathematical model that consists of– a finite set of states S– a set of input symbols ∑– a transition function move: S ∑ S

– a start state s0

– a finite set of final or accepting states F


Transition graph

• state

transition

start state

final state

A Ba

A

A

A


Transition table

a b

0 {0,1} {0}

1 - {2}

2 - {3}

Input symbolState


Acceptance

• A NFA accepts an input string x iff there is some path in the transition graph from start state to some accepting state such that the edge labels along this path spell out x.

A B

0

1

01010

01011

A B A B A B0 1 0 1 0

A B A B A ?0 1 0 1 1error

01

0


Deterministic finite automata

• A deterministic finite automaton (DFA) is a special case of NFA in which

1. no state has an -transition, and

2. for each state s and input symbol a, there is at most one edge labeled a leaving s.


Thompson’s construction of NFA from REs

• guided by the syntactic structure of the RE r

• For ,

• For a in ∑

i f

i fa


Thompson’s construction (cont’d)

• Suppose N(s) and N(t) are NFA’s for REs s and t– For s|t,

– For st,

– For s*,

– For (s), use N(s) itself

N(s)

N(t)i f

N(t)N(s)i f

N(t)i f


Outline



• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction)– DFA minimal DFA (Algorithm 3.6)

• Programming


Subset construction

Operation Description

-closure(s) Set of NFA states reachable from state s on -transition alone

-closure(T) Set of NFA states reachable from some state s in T on -transition alone

move(T,a) Set of NFA states to which there is a transition on input a from some state s in T

• s : an NFA state

• T : a set of NFA states


Subset construction (cont’d)

Let s0 be the start state of the NFA;

Dstates contains the only unmarked state -closure(s0);while there is an unmarked state T in Dstates do begin

mark Tfor each input symbol a do begin

U := -closure(move(T; a));if U is not in Dstates then

Add U as an unmarked state to Dstates;DTran[T; a] := U;

end;end;


DFA

• Let (∑, S, T, F, s0) be the original NFA. The DFA is:

• The alphabet: ∑ • The states: all states in Dstates• The transitions: DTran• The accepting states: all states in Dstates

containing at least one accepting state in F of the NFA

• The start state: -closure(s0)


Outline



• Token recoginition– REs NFA (Thompson’s construction) – NFA DFA (subset construction) – DFA minimal DFA (Algorithm 3.6)

• Programming


Minimise a DFA

Initially, create two states:1. one is the set of all final states: F2. the other is the set of all non-final states: S - F

while (more splits are possible) { Let S = {s1,…, sn} be a state and c be any char in ∑Let t1,…, tn be the successor states to s1,…, sn under cif (t1,…, tn don't all belong to the same state) {

Split S into new states so that si and sj remain in the

same state iff ti and tj are in the same state

}}


Example

A B D E

Cb

b

b

bb

a

a

a aa

Step1: {A,B,C,D} {E}

For a, {B,B,B,B}

For b, {C,D,C,E}

Split {A,B,C} {D} {E}

Step 2:

For b, {C,D,C}

Split {A,C} {B} {D} {E}

Step 3:

For a, {B,B}

For b, {C,C}

Terminate

A B D Eb

b

b

bba

a aa


Outline




• Programming


Input Bufferingbegin…

Scanner

eof

if (forward at end of first half) {reload second halfforward++

} else if (forward at end of second half) {

reload first halfforward = 0

} elseforward++


Input Bufferingbegin…

Scanner

eof

eof

eof

forward = forward + 1if (forward↑=eof) {

if (forward at end of first half) {reload second halfforward++

} else if (forward at end of second half) {

reload first halfforward = 0

} elseterminate the analysis

}


Transition Diagrams

relop <= | < |<> 0 1 2

3

4

< =

>

other

return(relop,LE)

return(relop,NE)

return(relop,LT)

id letter(letter|digit)* 5 6 7letter

letter or digit

other return(id,lexeme)

Transition diagram is a DFA in which there is no edge leaving out of a final state


Implementationtoken nexttoken() {

while (1) { switch (state) {

case 0: c = nextchar(); if (c == ‘<‘) state = 1;

else state = fail(0);break;

case 1: c = nextchar();if (c == ‘=‘) state = 2;else if (c == ‘>’ state = 3;else state = 4;break;

case 2: retract(0); return new

Token(relop,”<=”); case 4: retract(1);

return new Token(relop,”<”);

case 5: c = nextchar(); if (Character.isLetter(c))

state = 6;else state = fail(5);break;

case 6: c = nextchar();if (Character.isLetter(c)

||Character.isDigit(c)) continue;

else state = 7;break;

case 7: retract(1); return new Token(id,

getLexeme());


Implemetation (cont’d)

int fail(int current_state) {

forward = beginning;

switch (current_state) {

case 0: return 5;

case 5: error();

}

}

void retract(int flag) {

if (flag ==1)

move forward back

get lexeme from beginning to forward

move forward onward

beginning = forward

state = 0

}

b│e│g│i│n│:│=│ │ │…


Outline




• Programming

lexical analysis phung hua nguyen university of technology 2006

Documents