natural language grammars and parsing

53
MASTER DI SCIENZE COGNITIVE GENOVA 2005 14-10-05 Natural Language Grammars and Parsing Alessandro Mazzei Dipartimento di Informatica Università di Torino

Upload: others

Post on 03-Feb-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

MASTER DI SCIENZE COGNITIVEGENOVA 2005

14-10-05

Natural Language Grammars andParsing

Alessandro MazzeiDipartimento di Informatica

Università di Torino

Natural Language Processing

Phonetics acoustic and perceptual elements

Phonology inventory of basic sounds (phonemes) and basic rules for combination, e.g. vowel harmony

Morphology how morphemes combine to form words, relationship of phonemes to meaning

Syntax sentence formation, word order and the formation of constituents from word groupings

Semantics how do word meanings recursively compose to form sentence meanings (from syntax to logical formulas)

Pragmatics meaning that is not part of compositional meaning

Natural Language Syntax

Syntactic Parsing: deriving a syntactic structure fromthe word sequence

Syntactic structure

Word sequence

Natural Language Syntax

Syntactic Parsing: deriving a syntactic structure fromthe word sequence

PaoloPaolo ama FrancescaN

NP

S

Paolo ama Francesca

NV

VPama

Francesca

sub obj

Generative approach to Syntax

Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank

Formal Languages

Σ = {a1,a

2,...,a

n} alphabet

Σ* Σ = {0,1} 001,111110,ε,0 ∈ Σ*

Formal Language L ⊆ Σ*

Formal Languages

Σ = {0,1}

L1 = {01,0101,010101,01010101,...}

L2 = {01,0011,000111,00001111,...}

L3 = {11,1111,11111111,}

Formal Languages

Σ4 = {I,Anna,John,Harry,saw,see,swimming}

L4 = {I swim, I saw Harry swimming,...}

Natural and Formal languages

Σ = {a,aback,...,zoom,zucchini}

Natural Language L ⊆ Σ*

Generative approach to Syntax

Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank

Rewriting Systems

● Turing, Post

Rewriting rule: Ψ → θ

Generative grammar

G=(Σ,V,S,P)

Σ = alphabet

V = {A,B,...}

S ∈ V

P = {Ψ → θ,...}

Grammar and derivation

If A → β ∈ P

αAγ ⇒ αβγ directly derives

if α1 ⇒ α

2, α

2 ⇒ α

3, ... , α

m-1 ⇒ α

m

α1 ⇒* α

m derives

L(G)={x ∈ Σ* : S ⇒* x}

Grammar 1

● G1=({0,1},{A,B},A,{A→0B,B→1A,B→1})

A⇒0B⇒01

A⇒0B⇒01A⇒010B⇒0101

A⇒0B⇒01A⇒010B⇒0101A⇒01010B⇒010101

L(G1)={01,0101,010101,...}

Grammar 2

● G2=({0,1},{S},S,{S→0S1,S→01})

S⇒01

S⇒0S1⇒0011

S⇒0S1⇒00S11⇒000111

L(G2)={01,0011,000111,...}

Derivation tree

S⇒0S1⇒00S11⇒000111

0

S

1S

0 1S

0 1

Generative Grammars and Natural Languages

● Generative Grammars can model the natural language as a formal language

● The derivation tree can model the syntactic structure of the sentences

Grammar 3

● G4=(Σ

4,{S,NP,VP,V

1,V

2},S,P

4})

Σ4 = {I,Anna,John,Harry,saw,see,swiming}

P4 = {S→ NP VP, VP→V

1 S, VP→V

2,

NP→I|John|Harry|Anna, V

1→saw|see, V

2→swimming}

Grammar 3

● G4=(Σ

4,{S,NP,VP,V

1,V

2},S,P

4})

S⇒NP VP⇒I VP⇒I V1S⇒I saw S ⇒I saw NP VP ⇒

I saw Harry VP⇒I saw Harry V2⇒I saw Harry swimming

L(G3)={I swim,I saw Harry swim,...}

Grammar 3

I

V2

NP

S

SV1

VP

sawNP VP

Harry

swim

S⇒NP VP⇒I VP⇒I V1S⇒I saw S ⇒I saw NP VP ⇒

I saw Harry VP⇒I saw Harry V2⇒I saw Harry swimming

Generative Power

● What is the smallest class of generative grammars that can generate the natural languages?

● Weak vs. Strong Generative power

Languages Chomsky hierarchy

(ab)n

anbn

anbncn

a2n

LDiag

Linear A → aB

Context-freeS → aSb

Context-sensitiveCaa → aaCa

Type 0

Ψ → θ

Languages Chomsky hierarchy

(ab)n

anbn

anbncn

a2n

LDiag

Linear A → aB

Context-freeS → aSb

Context-sensitiveCaa → aaCa

Type 0

Ψ → θ

Mildly Context-sensitiveCB → f(C,B)

Generative approach to Syntax

Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank

Context-Free Grammars

G=(Σ,V,S,P) A → β

● Costituency● Grammatical relations● Subcategorization

Constituency

Constituent = group of contiguous (?!) words ● that are as a unit [Fodor-Bever,Bock-Loebell]

● that have syntactic properties

Ex. preposed-postposed, substitutability.

Noun Phrases (NP), Verb Phrases (VP),...

● CFG: Constituent ⇔ non terminal symbols V

Grammar lexicon

Grammar rules

Derivation Tree

Generative approach to Syntax

Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank

Parser

Parser

Paolo ama FrancescaN

NP

S

Paolo ama Francesca

NV

VP

Anatomy of a Parser

(1) Grammar

Context-Free, ...

(2) Algorithm

I. Search strategy top-down, bottom-up, left-to-right, ...

II.Memory organization back-tracking, dynamic programming, ...

(3) Oracle

Probabilistic, rule-based, ...

Grammar

Target Parse

Top-Down

Bottom-Up

Parser 1

(1) Grammar

Context-Free, ...

(2) Algorithm

I. Search strategy top-down, bottom-up, left-to-right, ...

II.Memory organization back-tracking, dynamic programming, ...

(3) Oracle

Probabilistic, rule-based, ...

Parser 1 (1)S→NP VPNP→DET NomNP→PropN

S→AUX NP VPAUX→doesNP→DET Nom

DET→thisNom→Noun

Noun→flightVP→Verb

Parser 1 (2)VP→Verb NPVerb→include

NP→Det NomDet→a

Nom→Noun

Noun→meal

Left-Recursion

NP → NP PP

Repeated Parsing subtrees

Ambiguity

● One sentence can have several “legal parse tree”

● 15 words ⇒ ~1000000 parse trees

Dynamic Programming ⇒ Earley Algorithm

Generative approach to Syntax

Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank

Probabilistic CFG

G=(Σ,V,S,P)

A → β [p] p ∈ (0,1)

PCFG

PCFGP(T

a) = .15 * .4 *.05 * .05 *

.35 * .75 * .4 * .4 * .4 * .3 * .4 * .5 =

= 1.5 x 10-6

P(Tb) = .15 * .4 *.4 * .05 *

.05 * .75 * .4 * .4 * .4 * .3 * .4 * .5 =

= 1.7 x 10-6

Parser 2 (CKY)

(1) Grammar

Context-Free, ...

(2) Algorithm

I. Search strategy top-down, bottom-up, left-to-right, ...

II.Memory organization back-tracking, dynamic programming, ...

(3) Oracle

Probabilistic, rule-based, ...

CKY idea

W1 W

2 W

3 W

4 W

5

C

P(1,4,A) = pA * P(1,2,B) * P(3,4,C)

P(1,4,D) = pD * P(1,2,B) * P(3,4,C)

B

A

A→BC [pA]

D→BC [pD]

W1 W

2 W

3 W

4 W

5

CB

D

Parser 2 (CKY)

Generative approach to Syntax

Formal languages Generative grammars Context-Free Parser Probabilistic parsing Treebank

Treebank

● How can we compute the probability of a PCFG? Counting

● Treebank: collection of syntactic annotated sentences (trees)

● Penn TB: 1M word

Treebank Grammars (PCFG)

P(A→β)=Count(A→β)/Count(A)

P(S→NP VP) =2/2=1 P(NP→N) =2/2=1

P(VP→V N) =1/2=.5 P(VP→V) =1/2=.5

P(N→Paolo) =2/3=.66 P(N→Francesca) =1/3=.33

P(V→corre) =1/2=.5 P(V→ama) =1/2=.5

Paolo ama FrancescaN

NP

S

NV

VP

Paolo corre N

NP

S

V

VP

References● SPEECH and LANGUAGE PROCESSING

D. Jurafsky and J.H. Martin

Prentice Hall 2000● Introduction to Automata and Language

Theory

J.E. Hopcroft and J.D.Ullman

Addison-wesley 1979● Natural Language Understanding

J.F. Allen

Benjamin Cummings 1995