lexical analysis iii recognizing tokens lecture 4 cs 4318/5331 apan qasem texas state university...

Lexical Analysis IIIRecognizing Tokens

Lecture 4CS 4318/5331Apan Qasem

Texas State University

Spring 2015

Announcements

• Assg 1 due this Friday at 11:59 PM

• Test instances on github

• No lecture at RRC this week

Lexical Analysis

int main() {

for (i = 0; i < MAX; i++)

printf(“Hello World”);

Scanner

<KEYWORD,int> <ID,main> <OP,(> <OP,)> <OP,{> <KEYWORD,int> <ID,i>

<SEP,;> <KEYWORD, for> <OP,(> <ID,I> <OP,=> <CONST,0> <SEP,;>

<ID,i> <ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,printf> <OP,(> <OP,“> <STR, Hello World> <OP,”> <OP,)> <SEP,;> <OP,{>

What do we do if we encounter a missing semi-colon?

Nothing!

Lexical Analysis

int main() {

int i;

for (i = 0; i < MAX; i++)

abcprintf(“Hello World”);

Scanner

<KEYWORD, int> <ID,main> <OP,(> <OP,)> <OP,{> <KEYWORD,int> <ID,i>

<SEP,;> <KEYWORD,for> <OP,(> <ID,I> <OP,=> <CONST,0> <SEP,;> <ID,i>

<ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,printf> <OP,(> <OP,“>

<STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,}>

What do we do if we encounter an undefined function name?

Nothing!

Lexical Analysis

int main() {

int i;

for (i = 0; i < MAX; i++)

abcprintf(“Hello World”);

Scanner

<KEYWORD, int> <ID,main> <OP,(> <OP,)> <OP,{> <KEYWORD,int> <ID,i>

<SEP,;> <KEYWORD,for> <OP,(> <ID,I> <OP,=> <CONST,0> <SEP,;> <ID,i>

<ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,abcprintf> <OP,(> <OP,“>

What do we do if we encounter an undefined function name?

Nothing!

Lexical Analysis

intmain(){inti;for(i=0;i<MAX;i++)printf(“Hello World”);}

Scanner

<ID,intmain> <OP,(> <OP, )> <OP,{> <KEYWORD,inti> <SEP,;>

<KEYWORD,for> <OP,(> <ID,i> <OP,=> <CONST,0> <SEP,;> <ID,i>

<ID,<> <ID,MAX> <SEP,;> <ID,i> <OP,++> <ID,printf> <OP,(> <OP,“>

Legal C program? Passes Scanner?

No Yes

Lexical Analysis

intmain(){inti;for(i=0;i<MAX;i++)printf(“Hello World”);}

Scanner

<ID,intmain> <OP,(> <OP, )> <OP,{> <ID,inti> <SEP,;>

<KEYWORD,for> <OP,(> <ID,i> <OP,=> <CONST,0> <SEP,;> <ID,i>

<ID,<> <ID,MAX> <SEP,;> <ID,i> <OP,++> <ID,printf> <OP,(> <OP,”>

Legal C program? Passes Scanner?

No Yes

Lexical Analysis

int main() {

int %$*&i;

for (i = 0; i < MAX; i++)

printf(“Hello World”);

Scanner

What’s an illegal C program at the scanner phase?

Very Few!C/C++ has become too large!

<KEYWORD,int> <ID,main> <OP,(> <OP, )> <OP,{> <KEYWORD,int> <OP,%>

<ID,$> <OP,*> <OP,&> <ID,i> <SEP,;> <KEYWORD, for> <OP,(> <ID,i>

<OP,=> <CONST,0> <SEP,;> <ID, i> <ID,<> <ID,MAX> <SEP,;> <ID,I> <OP,++> <ID,printf> <OP,(> <OP,“> <STR, Hello World> <OP,”> <OP,)> <SEP,;><OP,{>

Breaking Down Lexical Analysis Further …

1. Specify patterns for tokens• Look at language description and identify the types of

tokens needed for the language• usually trivial

• Use regular expressions to specify a pattern for each token• patterns for some tokens are trivial

2. Recognize patterns in the input stream and generate tokens for the parser

Recognizing Tokens

• We can specify the regular expression while

for the while keyword in C

• How do we recognize it if we see it in the input stream?• Essentially a pattern-matching algorithm

Code for Recognizing while

if (nextchar() == ‘w’) if (nextchar() == ‘h’) if (nextchar() == ‘i’) if (nextchar() == ‘l’) if (nextchar() == ‘e’)

return KEYWORD_WHILE; else

// do something else // do something else // do something else // do somethingelse

// do something

This approach works for more complex REs as well while (nextchar() == ‘a’ || …)

Need to decide what to do for strings like when

Need to account for strings like whileabc

Need to account for strings like abcwhile

Can we generate this code automatically?

Code for Recognizing while

if (nextchar() == ‘w’) if (nextchar() == ‘h’) if (nextchar() == ‘i’) if (nextchar() == ‘l’) if (nextchar() == ‘e’)

return KEYWORD_WHILE; else

// do something else // do something else // do something else // do somethingelse // do something

Each ‘if clause’ represents a state

The state is determined solely based on what we have seen so far in the input stream

No need to go back and rescan input

At each state we make a decision to move to a new state based on the next input symbol

This is exactly the idea behind (deterministic) finite state

machines

Recognizing Tokens

General idea• Consume a character from the input stream

• Based on the value of the character move to a new state • If the character just consumed

• produces a valid token and no more characters to consume then DONE

• leads to a valid token, move to a valid state• produces an invalid token go to error state and finish

• Repeatabove recognizes one token

Recognizing Tokens

• Need to construct a recognizer based on regular expressions

• A recognizer for a regular expression is a machine that recognizes the language described by the RE

• Given an input string constructed from the alphabet, the recognizer will

• Say “yes” if the string is in the language (ACCEPT)• Say “no” if the string is not in the language (REJECT)

• Implications • Must produce a yes or no answer on every input• Cannot say yes when the string is not in the language (false

positives)

RE and DFA

For every RE there is a recognizer that recognizes the corresponding RL

If you build it … it will be recognizable!

The recognizers are called deterministic finite automata (DFAs)

Kleene’s Theorem (1952)

Deterministic Finite Automata

Formal mathematical construct • Abstract state machines that can recognize regular

languages• A set of states with transitions defined on each input

symbol on every state• Formal definition in Text (Section 2.2.1) • Convenient to reason about DFAs using state transition

diagrams

DFA Diagram

s0 s2s1 s3i n t

initial state

error state

final state

error states sometimes implicit

only one initial state

can have multiple final states

Acceptance Criteria for DFAs

• A DFA accepts a string if and only if the DFA ends up in a final state after consuming all input symbols

• Implications • A DFA built to recognize int will _______ intmain

• A DFA built to recognize intmain will _______ int

reject

Easy fix if we want the machine to recognize int AND intmain

DFA Example : if

s0 s1i f s2

DFA Example: int | if

s0 s1i f s3

DFA for if | int

s1if s3

s4ns2i t s5

Non-determinism

DFA Example : Integers

Σ = {0-9}Digit : 0|1|2|3|… |9 Integer : 0 | (1|2|3|… |9)(Digit)*

REs and DFAs

every RL has a DFA that recognizes it and every DFA has a corresponding RL

there are algorithms that allow us to convert an RE to a DFA and vice versa

we can automate scanning!

to convert REs to DFAs we need to first look at non-deterministic finite

automata (NFA)

Non-determinism

DFAs do not allow non-determinism• Must have a transition defined on every state

on every possible input symbol• Cannot move to a new state without

consuming an input symbol• Cannot have multiple transitions on the same

input symbol

• DFAs with transitions

• To run NFAs, start at the initial state and guess the right transition at each step• Always guess correctly• If some sequence of correct guesses leads to a

final state then accept

Sounds dubiousBut works!

NFA for if | int

s1if s3

i t s5

NFA, multiple transitions on i in state s0

NFA and DFA

• Although NFAs allow non-determinism it has been shown that NFAs and DFAs are equivalent!

Scott and Rabin (1959)

• DFAs are just specialized forms of NFAs• NFAs and DFAs both recognize the same set of languages • Can simulate a DFA with an NFA• Can construct corresponding DFAs for any NFA

• Implication• For every RE there is also an NFA

Relatively easy to construct an NFA from an RE

RE to NFA : Empty String

1. is a regular expression that denotes { }, the set that contains the empty string

RE to NFA : Symbol

2. For each , a is a regular expression denoting {a}, the set containing the string a.

s0 s1a

RE to NFA : Union

3. r | s is an RE denoting L(r) U L(s) e.g., RE = a | b L(RE) = {a, b}

s0 s1b

s0 s1a s1 s3a

s2 s4b

RE to NFA : Concatenation

4. rs is an RE denoting L(r)L(s) e.g., RE = ab L(RE) = {ab}

s0 s1b

s0 s1a

s1 s3s0 a bs2

RE to NFA : Closure

5. r* is an RE denoting L(r)* e.g., RE = a* L(RE) = { , a, aa, aaa, aaaa, …}

s1 s3s0 a s2s0 s1a

RE to NFA

• The algorithm for converting REs to NFAs is known as Thompson’s construction• Repeated application of the five conversion

rules!• Named after Ken Thompson (1968)

Example : NFA for a(b|c)*

Work inside parentheses b|c

s0 s1c

s0 s1b

Example : NFA for a(b|c)*

Work inside parentheses b|c

s2 s4c

s1 s3b

Adjust final statesRename states

Example : NFA for a (b|c)*

Step 3: * (closure)

(b | c)*

s1 s3b

s2 s4c

s5s0 s5s0

Step 3: * (closure)

(b | c)*

s2 s4b

s3 s5c

s6s0 s7s1

Step 4: concatenation

s4 s5b

s6 s7c

s8s1 s9s3s2s0a

Cycle of Construction

MinimizedDFA

Thompson’s Construction

SubsetConstructionHopcroft’s

Algorithm

lexical analysis iii recognizing tokens lecture 4 cs 4318/5331 apan qasem texas state university...

max i abcprintfhello

lexical analysis int

max i printfhello world

lexical analysis intmain

week slide

undefined function

recognizing tokens lecture

announcements assg

Documents

automatic tuning of scientific applications apan qasem ken...

a closer look at christianity - mohamed qasem

comp 5331: knowledge discovery and data mining

bs - hoang vu hiep - 101310 5331

5331 file5331v115-uk 1 2-wire programmable transmitter 5331...

4318 whitewater creek road

lexical analysis iv : nfa to dfa dfa minimization lecture 5...

nai harn villa-4318

scanned using book scancenter 5331 - drivinstruct ·...

5331 5335.output

lexical analysis i specifying tokens lecture 2 cs 4318/5531...

international finance fina 5331 lecture 9:

autonomy & confidentiality dr leena al-qasem. autonomy

ms28775-232 o-ring spec - military-fasteners.com...

ms28775-210 o-ring spec - military-fasteners.com...

presented by: sawsan qasem

in the united states district court yasin qasem...

parsing v: bottom-up parsing lecture 10 cs 4318/5531 spring...

wireless networking ramiah qasem, jahmia algahmie, andrew...

international finance fina 5331