cs 321 programming languages and compilers lectures 16 & 17 introduction to formal languages...

47
CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Upload: jeffery-clarke

Post on 13-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

CS 321Programming Languages and

CompilersLectures 16 & 17

CS 321Programming Languages and

CompilersLectures 16 & 17

Introduction to Formal Languages

Regular Languages

Lexical Analysis

Page 2: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing22

LanguagesLanguages

• Have a finite vocabulary

• Have finite length sentences

• Have possibly infinitely many sentences

Page 3: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing33

Grammars and RecognizersGrammars and Recognizers

• A Grammar is a finitary method by which all sentences of a language, L, may be generated via well-defined rules.

• A Recognizer is a procedure which, given a “string” x, answers “yes” if x L

• We usually also want to answer “no” if x L, I.e. usually demand an algorithm.)

Page 4: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing44

(Context-Free) Grammars(Context-Free) Grammars

• Def. A (context-free or Chomsky Type-2) grammar (cfg) is a 4-tuple

G = (N, , P, S)

where– N is a finite, non-empty set of symbols (non-terminal

vocabulary) is a finite set of symbols (terminal vocabulary)

– N =

– V N (vocabulary)

– S N (goal symbol)

– P is a finite subset of N V* (production rules)

Page 5: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing55

Set Operations

Set Operations

• Def. Let X and Y be sets of words

XY {xy | x X and y Y}

X0 {} (where represents the empty string)

X1 X

XI+1 XiX

X* i 0 Xi

X+ i > 0 Xi (so X+ = X* X)

Page 6: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing66

ExampleExample

• G = (N, , P, E)

where

N = {E, T, F}

= {[, ], +, *, id}

P = {(E,T), (E,E+T), (T,F), (T,T*F), (F,id), (F,[E])}

• (so V = N = {E, T, F, [, ], +, *, id})

• (A, ) P is usually written

A

or A ::=

or A :

Page 7: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing77

ConventionConvention• Given G = (N, , P, S) (with V = N )

(or G = (V, , P, S) with N=V- )– elements of N: A, B, …

– elements of V: … U, V, W, X, Y, Z

– elements of : a, b, …

– elements of *: … u, v, w, x, y, z

– elements of V *: , , , , ,

• others:– names (not underlined) : N

– S: N

– underlined or courier font:

– special symbols: is used to denote a production rule: ( = A )

Page 8: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing88

Generating LGenerating L

• How to use a grammar, G, to generate a sentence in L(G):

• Begin with a string, consisting of only the goal symbol.

• repeat

select from a non-terminal “A” and

“rewrite” A according to some production

(A, )

thereby producing ’ from .

until ’ *

Page 9: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing99

ExampleExampleG = (N, , P, S) where P is (abbreviated) as follows:

E T | E + T

T F | T * F

F id | < E >

and where

N = {E, T, F, Q}

= {+, *, <, >, id}

S = E

Page 10: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1010

Regular SetsRegular Sets

• Regular sets (also called regular languages) are defined as follows. Let be a finite alphabet.

1) is a regular set over .

2) {} is a regular set over .

3) a , {a} is a regular set over .

4) If P and Q are regular sets over ,

a) P Q is a regular set over .

b) PQ is a regular set over .

c) P* is a regular set over .

5) Nothing else is a regular set over .

Page 11: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1111

Regular ExpressionsRegular Expressions

1) denotes the regular set .

2) denotes the regular set {}.

3) a denotes the regular set {a}.

4) If p and q are regular expressions denoting the regular sets P and Q respectively, then

a) (p|q) denotes P Q.

b) (pq) denotes PQ.

c) (p)* denotes p*

5) Nothing else is a regular expression.

***

Notation: (p)+ ((p)*p)

(p)? p |

Page 12: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1212

Right-Linear Grammars (Generators for Regular Sets)

Right-Linear Grammars (Generators for Regular Sets)• Def. Let G = (N, , P, S) be a cfg. G is said to be

right-linear if

P N (* *N)

***

• Proposition. If G is a right-linear cfg then L(G) is a regular set over .

• Proposition. If R is a regular set over , then a right-linear cfg, G, for which L(G) = R.

Page 13: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1313

Finite Automata (Recognizers for Regular Sets)Finite Automata (Recognizers for Regular Sets)

Def. A deterministic finite automaton (deterministic finite state machine) is a 5-tuple:

M = (Q, , , q0, F)

where

1) Q is a finite non-empty set of states.

2) is a finite set of input symbols.

3) q0 Q (initial state)

4) F Q (final states)

5) is a partial mapping from Q to Q (transition function or move function)

Page 14: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1414

Transition DiagramsTransition Diagrams

• FSMs are often visualized as transition diagrams.

p

r

s

q start

0|1

0|1

0|1

0|1

Page 15: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1515

Finite State MachinesFinite State Machines• The preceding transition diagram can be

represented by a tabular move function:

0 1

p q q s

q q q r

r r r

s r r

Page 16: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1616

Finite State MachinesFinite State Machines• The preceding transition diagram can be

represented by a tabular move function:

0 1

p q q s

q q q r

r r r

s r r

q0

Q

F

Page 17: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1717

Formalizing the Moves of a FSMFormalizing the Moves of a FSM

• A pair (q,u) in Q * is called a configuration of M.

• (q0, u) is an initial configuration.

• M proceeds from one configuration to the next by moving according to the transition function:

(q, au) (q’, u) if (q, a)=q’

(q, u) … (q’, v)

is written

(q, u) * (q’, v)

• The language accepted (or defined) by M is

L(M) = {u * | (q0, u) * (q, ) for some q F}

Note: Sometimes is used to denote the empty string

Page 18: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1818

ExampleExample

• With the machine

M = ({p,q,r,s}, {0,1, }, , p, {q,r})

where the move function is shown in the preceding table.

• Question 1: Is 010 L(M)?

• Question 2: Is L(M)?

• Question 3: Is 010 L(M)?

Page 19: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing1919

“Complete” Finite State Machines“Complete” Finite State Machines

• Extend :

0 1

p q q s

q q q r

r r r t

s r r t

t t t t

/

Page 20: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2020

Complete Finite State MachineTransition Diagram Version

Complete Finite State MachineTransition Diagram Version

p

r

s

q start

0|1

0|1

0|1

0|1

t

0|1|

Page 21: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2121

Non-deterministic FSMsNon-deterministic FSMs

• A FSM may have a choice of moves, i.e. is a mapping from Q to 2Q.

• Proposition. Let M1 be a non-deterministic FSM. Then a DFSM M2 for which L(M2) = L(M1).

• Proposition. Given a NFSM, M, one can construct a right-linear cfg, G, for which L(G) = L(M), and conversely.

Page 22: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2222

Extended Non-determinismExtended Non-determinism

• Besides allowing multiple moves on the same input symbol, we can allow moves on the empty string, ; i.e. for a given state q:

(q, ) Q

Page 23: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2323

Examples

0 1 2 3

start a|b

a b b

2

4

1

3

0

start

a

b

b

a

Page 24: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2424

Thompson’s ConstructionThompson’s Construction

• Given a regular expression, r representing a regular set R, construct a non-deterministic finite state machine M that recognizes R, i.e. such that L(M)=R.

1) For construct

i f

start

Page 25: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2525

Thompson’s ConstructionThompson’s Construction

2) For a in construct

i f

start

a

Page 26: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2626

Thompson’s ConstructionThompson’s Construction3) Suppose N(s) and N(t) are NFSM's for regular

expressions s and t.

a) For the regular expression s|t, construct

N(s)

N(t)

s f

start

Page 27: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2727

Thompson’s ConstructionThompson’s Construction

b) For the regular expression st, construct:

i N(s) N(t)

start

f

Page 28: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2828

Thompson’s ConstructionThompson’s Construction

c) For the regular expression s*, construct

N(s) i f

start

Page 29: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing2929

Transforming a NFSM to a DFSM (The Subset Construction)

Transforming a NFSM to a DFSM (The Subset Construction)• Define:

-closure(sQ) = {tQ | s can reach t via only -moves}

-closure(T Q) = -closure(s)

move(T Q, a ) = (s,a)

sT

sT

Page 30: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3030

NFSM DFSMNFSM DFSM• Given M=(Q, , , q0, F) define M’=(Q’, , ’, q’0, F’)

by:

1) Compute q’0 = -closure(q0).

2) Initialize Q’ with q’0 (unmarked).

3) while an unmarked element q’ of Q’:

a) mark q’

b) a :

-- compute p’ = -closure(move(q’, a))

-- if p’ Q’ then add p’ (unmarked) to Q’

-- set ’(q’, a)=p’

4) F’ = { q’ Q’ | q q’ q F}

Page 31: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3131

ExampleExample

• Perform Thompson’s Construction on (a|b)*abb to obtain a non-deterministic finite state machine.

• Perform the subset construction to make it deterministic.

Page 32: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3232

Simulating a DFSMSimulating a DFSM

s:= q0

a:=nextchar

while a eof {

s:= (s,a)

a:=nextchar

}

if s F then return “yes”

else return “no”

Page 33: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3333

Simulating a NFSMSimulating a NFSM

S:= -closure({q0})

a:=nextchar

while a eof {

S:= -closure(move(S,a))

a:=nextchar

}

if S F then return “yes”

else return “no”

Page 34: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3434

Transforming from NFSM to Right-Linear CFGTransforming from NFSM to Right-Linear CFG

• Given M=(Q, , , q0, F), construct G=(Q, , P, q0) where

1) q F include in P

q

2) q1, q2 Q; a q2 (q1, a) include in P

q1 a q2

3) q1, q2 Q q2 (q1, ) include in P

q1 q2

Page 35: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3535

ExampleExample• Let M be:

(Note, this is not something obtained from Thompson’s Construction, but written by hand.)

• We have:

q0 a q0 | b q0 | a q1

q1 b q2

q2 b q3

q3

0 1 2 3

start a|b

a b b

Page 36: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3636

RLG Regular ExpressionRLG Regular Expression

• The algorithm resembles Gaussian Elimination.• Notice that all of the “A-rules” can be “grouped” by the

non-terminal on the right side of the right-part and “factored”:

A 0A

A 1A1

A 2A2

A n-1An-1

A n

where the i are regular expressions over

Page 37: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3737

RLG Regular ExpressionRLG Regular Expression

• Then A can be written as the following regular expression over V:

A = 0*( 1A1 | 2A2 | … | n-1An-1 | n )

and the above regular expression can be substituted for A everywhere A appears in the grammar.

• Following that, all rules can again be written in the foregoing “factored” form.

Page 38: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3838

RLG Regular ExpressionRLG Regular Expression• Given a right-linear grammar G=(N, . P, S):A) repeat

1) write all rules in “factored” form.2) choose some non-terminal, A S, to eliminate.3) compute the regular expression, r, which is

equivalent to A, and substitute r in place of A everywhere in G.

4) delete all A-rules from G until only S-rules remainB) compute the regular expression, r, to which S is

equivalent.

Page 39: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing3939

ExampleExample• Recall

q0 a q0 | b q0 | a q1

q1 b q2

q2 b q3

q3 • Rewrite q0 (a | b) q0 | a q1

q1 b q2

q2 b q3

q3

Page 40: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4040

ExampleExample

• Eliminate q3 q0 (a | b) q0 | a q1

q1 b q2

q2 b

• Eliminate q2 q0 (a | b) q0 | a q1

q1 b b

• Eliminate q1 q0 (a | b) q0 | a b b

• Compute q0 q0 = (a | b)* a b b

Page 41: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4141

Limitations of FSMsLimitations of FSMs

• FSMs have a fixed numbers of states

• For this reason, there are objects that cannot be recognized by FSMs.

• For example there is no FSM that can recognize palindromes of arbitrary length.

• The DO keyword in Fortran cannot be expressed as a regular expression.

Page 42: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4242

Minimization of DFSM’sMinimization of DFSM’s

• Well-known algorithm (due to Hopcroft), useful in many other circumstances.

1) Initially partition Q into two groups, F and Q-F.

2) repeat

group, G, of the partition, split G into multiple sub-groups, if incompatible

transitions are found among members of G.

until no further changes occur

Page 43: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4343

ExampleExample

0 1

p q1 q2 s

q1 q1 q2 r1

q2 q1 q2 r1

r1 r1 r1

r2 r2 r2

s r2 r2

0 1

p q1 q2 s

q1 q1 q2 r1

q2 q1 q2 r1

r1 r1 r1

r2 r2 r2

s r2 r2

final

Page 44: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4444

Algebraic PropertiesAlgebraic Properties

Axiom Description

r | s = s| r | is commutative

r | (s| t) = r | (s| t) | is associative

(rs)t = r (st) concatenation is associative

r (s| t) = rs| rt(s| t)r = sr | tr

concatenation distributes over|

r = rr = r

is the identity element forconcatenation

r* = ( r | )* relation between * and

r** = r* * is idempotent

Page 45: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4545

Shorthand NotationsShorthand Notations

• (a)+ denotes one or more instancer* = r+ | r+ = rr*

• (r)? denotes zero or one instancer? = r |

• [a-z] denotes a|b|c|..|z

Page 46: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4646

ExamplesExamples

• [a-zA-Z]+ denotes string of one or more characters

• [a-zA-Z][a-zA-Z0-9] + denotes valid identifiers in Fortran

• [0-9] +(.[0-9] +)?(E(+|-)?[0-9] +)? denotes valid unsigned Pascal numbers

Page 47: CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis

Finite Automata & LexingFinite Automata & LexingFinite Automata & LexingFinite Automata & Lexing4747

Extended Transition Diagrams for Parts of PascalExtended Transition Diagrams for Parts of Pascal