speech & nlp: syntax & parsing

51
Speech & NLP Syntax & Parsing Vladimir Kulyukin

Upload: vladimir-kulyukin

Post on 14-Dec-2014

77 views

Category:

Science


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Speech & NLP: Syntax & Parsing

Speech & NLP

Syntax & Parsing

Vladimir Kulyukin

Page 2: Speech & NLP: Syntax & Parsing

Outline

Syntax & Parsing

Context-Free Grammars

Definition

Epsilon Productions

Useful & Useless Symbols

Chomsky Normal Form (CNF)

Cocke-Younger-Kasami Algorithm & CFL Membership

Problem

Page 3: Speech & NLP: Syntax & Parsing

Syntax & Parsing

Syntax in the NLP context refers to the study of

sentence or text structure

Parsing is the process of assigning a parse tree to

a string

A grammar is required to generate parse trees

Grammars for natural languages consists of

syntactic categories and parts of speech

Page 4: Speech & NLP: Syntax & Parsing

Context-Free Grammars

Page 5: Speech & NLP: Syntax & Parsing

Context-Free Grammar (CFG): Definition

.

over string a is and where, form theof

is productioneach s;production ofset finite a is

symbol;start theis

alphabet; terminal theis

alphabet; lnontermina theis

where,,,, tuple-4 a is CFG A

yVX yX

P

VS

Σ

V

PSVGG

Page 6: Speech & NLP: Syntax & Parsing

A Sample NL CFG

S NP VP

S AUX NP VP

S VP

NP DET NOMINAL

NOMINAL NOUN

NOMINAL NOUN NOMINAL

NP ProperNoun

NP VERB

DET that | this | a

NOUN left

AUX does

VERB make

PREP from | to | on

ProperNoun USU

NOMINAL NOMINAL PP

PP PREP NP

Page 7: Speech & NLP: Syntax & Parsing

Formal Context-Free Languages

Page 8: Speech & NLP: Syntax & Parsing

Example 01

. from derived is that show ,on induction

By .| :CFG following heConsider t :Proof

free.-context is L that Show .0 Let :Claim

Sban

aSbS

|nbaL

nn

nn

Page 9: Speech & NLP: Syntax & Parsing

Example 02

. from derived is that show ,on induction

By .| :CFG following heConsider t :Proof

free.-context is that Show .0|Let :Claim

3

3

Sban

aSbbbS

LnbaL

nn

nn

Page 10: Speech & NLP: Syntax & Parsing

Useful & Useless Symbols in CFGs

Page 11: Speech & NLP: Syntax & Parsing

Useful & Useless Symbols

Let G = (V, T, P, S) be a CFG grammar

A symbol X is useful if there is a derivation S * αXβ * w for some α, β in (V U T)* and w is in T*

A symbol X is useless if there is no such derivation

Page 12: Speech & NLP: Syntax & Parsing

Useful & Useless Symbols

Let G = (V, T, P, S) be a CFG grammar

A symbol X is useful if there is a derivation S * αXβ * w for some α, β in (V U T)* and w is in T*

A symbol X is useless if there is no such derivation

Page 13: Speech & NLP: Syntax & Parsing

Example: Useful & Useless Symbols

Suppose CFG G has the following productions:

S AB | a

A a

A, B are useless symbols

S is a useful symbol

Page 14: Speech & NLP: Syntax & Parsing

Elimination of Useless Symbols & ɛ-Productions

Every non-empty context-free language that does not contain ɛ can be generated by a grammar with no useless symbols or ɛ-productions

Page 15: Speech & NLP: Syntax & Parsing

Chomsky Normal Form (CNF)

A grammar G = (V, T, S, P) is said to be in Chomsky Normal Form (CNF) if each production in P has the following form: 1)A BC 2)A a

where A, B, C are in V and a is in T

Page 16: Speech & NLP: Syntax & Parsing

CNF Theorem

Let G be a grammar with no useless symbols

and no ε-productions. There is a CNF grammar

G’ such that L(G) = L(G’).

Page 17: Speech & NLP: Syntax & Parsing

Cocke-Younger-Kasami (CYK) Algorithm

Page 18: Speech & NLP: Syntax & Parsing

CYK Algorithm’s Problem

Problem: Given a CFG G = (V, T, P, S) and a string x in T*, determine if x is in L(G)?

The Cocke-Younger-Kasami (CYK) algorithm takes a CFG in CNF and a string and determines if S is one of the symbols that derive x

Page 19: Speech & NLP: Syntax & Parsing

Substring Notation xsl

Let x be a string such that |x|= n ≥ 1

Let xsl be the substring of x of length l that starts at position s, 1≤ s ≤ n and 1≤ l ≤ n

For example, if x = aabbabb, then x13 = aab = x[1]x[2]x[3] and x24 = abba = x[2]x[3]x[4]x[5]

In general, if we do 1-based array indexing and the length of the substring is l, the last available position s at which the substring can start is n – l + 1

For example, if |x| = 4 and l = 2, the possible values for s in xs2 are 1, 2, and 3 = 4 – 2 + 1

Page 20: Speech & NLP: Syntax & Parsing

CYK Algorithm: Basic Insight

A

B C

xsk x(s+k)(l-k)

s s+k s+l s+k-1

xsl

A * xsl iff

1) A BC;

2) B * xsk;

3) C * x(s+k)(l-k), for some k, 1 ≤ k < l

In other words, to determine if A

* xsl there must be a rule A BC

and some k, 1 ≤ k < l, for which B

* xsk and C * x(s+k)(l-k).

Page 21: Speech & NLP: Syntax & Parsing

Table D[s, l]

CYK is a dynamic programming algorithm that,

given a CNF grammar G = (V, T, S, P) and a string

x over a specific alphabet such that |x|= n > 0,

incrementally builds a n x n table D (D stands for

‘derives’)

D[s, l] is a set, possibly empty, of symbols A in V

such that A * xsl

In other words D[s, l] records all variables in G

that derive xsl

Page 22: Speech & NLP: Syntax & Parsing

Table D[s, l]

CYK is a dynamic programming algorithm that,

given a CNF grammar G = (V, T, S, P) and a string

x over a specific alphabet such that |x|= n > 0,

incrementally builds a n x n table D (D stands for

‘derives’)

D[s, l] is a set, possibly empty, of symbols A in V

such that A * xsl

In other words D[s, l] records all variables in G

that derive xsl

Page 23: Speech & NLP: Syntax & Parsing

D[s, l] Initialization

Let G = (V, T, S, P) be a CNF grammar and x be a

string such that |x|= n > 0,

Let xsl be the substring of x of length l that starts

at position s

If l = 1, then, for each 1≤ s ≤ n, we can check if

xs1 can be derived directly from some variable A

of G

How? By checking if G has a production A xs1

Page 24: Speech & NLP: Syntax & Parsing

D[s, l] Initialization

Assume that our CNF grammar is as follows:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Assume that the input is x = baaba

What does D[s, l] look like?

Page 25: Speech & NLP: Syntax & Parsing

5 x 5 D[s, l]

s

1 2 3 4 5

1

2

3

4

5

l

Page 26: Speech & NLP: Syntax & Parsing

Computing D[1,1]

The input is x = baaba

The 1st symbol of the input is b

Thus, D[1,1] = {A | A b},

where A is in V

There is only one production

that qualifies: B b

So D[1,1] = {B}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 27: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B}

s

1 2 3 4 5

1

2

3

4

5

l

Page 28: Speech & NLP: Syntax & Parsing

Computing D[2,1]

The input is x = baaba

The 2nd symbol of the input is a

We compute {A | A a} , where A is in

V

There are two such productions: A a,

C a

So D[2, 1] = {A,C}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 29: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C}

s

1 2 3 4 5

1

2

3

4

5

l

Page 30: Speech & NLP: Syntax & Parsing

Computing D[3,1]

The input is x = baaba

The 3rd symbol of the input is a

We compute {A | A a} , where A

is in V

There are two such productions: A

a, C a

So D[3, 1] = {A,C}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 31: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C}

s

1 2 3 4 5

1

2

3

4

5

l

Page 32: Speech & NLP: Syntax & Parsing

Computing D[4,1]

The input is x = baaba

The 4th symbol of the input is b

Thus, D[4,1] = {A | A b}, where

A is in V

There is only one production that

qualifies: B b

So D[4,1] = {B}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 33: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B}

s

1 2 3 4 5

1

2

3

4

5

l

Page 34: Speech & NLP: Syntax & Parsing

Computing D[5,1]

The input is x = baaba

The 5th symbol of the input is a

We compute {A | A a} , where A

is in V

There are two such productions:

A a and C a

So D[5, 1] = {A,C}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 35: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

s

1 2 3 4 5

1

2

3

4

5

l

Page 36: Speech & NLP: Syntax & Parsing

Computing D[1,2]

We need to find k, such that 1 ≤ k < 2 and look for productions A BC where B is in D[1,1] and C is in D[2,1]

Since D[1,1] = {B} and D[2,1] = {A, C}, the possibilities for the right-hand sides are {B} x {A, C} = {BA, BC}

The rules that match these possibilities are S BC and A BA

So D[1,2] = {S,A}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 37: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

{S, A}

s

1 2 3 4 5

1

2

3

4

5

l

Page 38: Speech & NLP: Syntax & Parsing

Computing D[2,2]

We need to find k, such that 1 ≤ k < 2,

and the rules A BC, where B is in

D[2,1] and C is in D[3,1]

Since D[2,1] = {A,C} = D[3,1] = {A,C},

the right-hand side possibilities are

AA, AC, CA, CC

There is only one rule that qualifies: B

CC

So D[2,2] = {B}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 39: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

{S, A} {B}

s

1 2 3 4 5

1

2

3

4

5

l

Page 40: Speech & NLP: Syntax & Parsing

Computing D[3,2]

We look for k, such that 1 ≤ k < 2 and

rules of the form A BC, where B is in

D[3,1] and C is in D[4,1]

D[3,1] = {A,C} and D[4,1] = {B}

So the right-hand side (RHS) possibilities

are AB, CB

The rules whose RHS’s that match these

possibilities are: S AB and C AB

So D[3,2] = {S,C}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 41: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

{S, A} {B} {S, C}

s

1 2 3 4 5

1

2

3

4

5

l

Page 42: Speech & NLP: Syntax & Parsing

Computing D[4,2]

We look for k, such that 1 ≤ k < 2 and

rules of the form A BC, where B is

D[4,1] and C is in D[5,1]

V[4,1] = {B}; V[5,1] = {A,C}

So the RHS possibilities are BA and BC

The rules whose RHS’s that match these

possibilities are: S BC and A BA

So D[4,2] = {S,A}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 43: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

{S, A} {B} {S, C} {S, A}

s

1 2 3 4 5

1

2

3

4

5

l

Page 44: Speech & NLP: Syntax & Parsing

Computing D[1,3]

We look for k, such that 1 ≤ k < 3 and rules of the form A BC, where, for k = 1, B is in D[1,1] and C is in D[2,2] or where, for k = 2, B is in D[1,2] and C is in D[3,1]

For k = 1, D[1,1] = {B} and D[2,2] = {B}, so there is only one right-hand side possibility: BB

The grammar does not have any productions whose right-hand side is BB

For k = 2, D[1,2] = {S,A} and D[3,1] = {A,C}, so the RHS possibilities are: SA, SC, AA, AC

The grammar does not have any productions whose RHS’s are SA, SC, AA, AC

So D[1,3] = { }

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 45: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

{S, A} {B} {S, C} {S, A}

{ }

s

1 2 3 4 5

1

2

3

4

5

l

Page 46: Speech & NLP: Syntax & Parsing

Computing D[2,3]

We look for k, such that 1 ≤ k < 3 and rules of the form A BC, where, if k = 1, B is in D[2,1] and C is in D[3,2] or where, if k = 2, B is in D[2,2] and C is in D[4,1]

For k = 1, D[2,1] = {A,C} and D[3,2] = {S,C}

The RHS possibilities are: AS, AC, CS, CC

The only rule that matches is B CC

For k = 2, D[2,2] = {B} and D[4,1] = {B}

The possibilities are: BB

No rules match

So D[2,3] = {B}

G’s Productions:

1. S AB | BC

2. A BA | a

3. B CC | b

4. C AB | a

Page 47: Speech & NLP: Syntax & Parsing

D[s, l] So Far

{B} {A, C} {A, C} {B} {A, C}

{S, A} {B} {S, C} {S, A}

{ } {B}

s

1 2 3 4 5

1

2

3

4

5

l

Page 48: Speech & NLP: Syntax & Parsing

Rest of D[s, l]

{B} {A, C} {A, C} {B} {A, C}

{S, A} {B} {S, C} {S, A}

{ } {B} {B}

{ } {S, A, C}

{S, A, C}

s

1 2 3 4 5

1

2

3

4

5

l

Page 49: Speech & NLP: Syntax & Parsing

Is x=baaba Accepted?

Yes, because D[1,5] contains S. It means that S * xsl.

In other words, the substring of x that starts at 1 and

has a length of 5 is derivable from S.

Page 50: Speech & NLP: Syntax & Parsing

How & Why CYK Works

CYK runs in O(n3), where |x| = n > 0

Both k and l-k are strictly less than l

If we know that each of the two smaller

derivations exists (i.e. B * xsk and C *

x(s+k)(l-k)), we can determine if A BC

When we reach l=n, we can determine if

S* x1n

Page 51: Speech & NLP: Syntax & Parsing

References & Reading Suggestions

Hopcroft and Ullman. Introduction to Automata

Theory, Languages, and Computation, Narosa

Publishing House

Moll, Arbib, and Kfoury. An Introduction to Formal

Language Theory

Jurafsky & Martin. Speech & Language Processing.

Prentice Hall.

www.youtube.com/vkedco