data-oriented parsing and discontinuous constituents · 2020. 6. 25. · data-orientedparsing...

56
Data-Oriented Parsing and Discontinuous Constituents Andreas van Cranenburgh Huygens ING Institute for Logic, Language and Computation Royal Netherlands Academy of Arts and Sciences University of Amsterdam March 4, 2014 Amsterdam 2014

Upload: others

Post on 15-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Data-Oriented Parsingand Discontinuous Constituents

Andreas van Cranenburgh

Huygens ING Institute for Logic, Language and ComputationRoyal Netherlands Academy of Arts and Sciences University of Amsterdam

March 4, 2014

Amsterdam 2014

Page 2: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

This lecture

1. Context & context freedom2. Parsing with . . .

I discontinuous constituents:Linear Context-Free Rewriting Systems (LCFRS)

I treebank fragments:Data-Oriented Parsing (DOP)Tree-Substitution Grammar (TSG)

3. Other non-context-free challenges

Page 3: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Context and Context Freedom

Two meanings for context-free:1. Rewrite operations are independent from anything

not being rewritten.Counterexamples: HPSG, LFG, &c.model theoretic syntax;unification based⇒ exp. time complexity

2. CFG: a grammar 〈V , T , S,P〉 with the above property,such that productions in P are of the form:

α→ β1 . . . βn

Counterexamples: TAG, CCG, &c.NB: a formalism can be context-free (1)without being Context-Free (2) . . .

Page 4: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Context and Context Freedom

Two meanings for context-free:1. Rewrite operations are independent from anything

not being rewritten.Counterexamples: HPSG, LFG, &c.model theoretic syntax;unification based⇒ exp. time complexity

2. CFG: a grammar 〈V , T , S,P〉 with the above property,such that productions in P are of the form:

α→ β1 . . . βn

Counterexamples: TAG, CCG, &c.NB: a formalism can be context-free (1)without being Context-Free (2) . . .

Page 5: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

The Domain of LocalityDomain of Locality: the information which is availablewhile applying rewrite operations

CFG: parent→ nonterminals dominatingadjacent terminals

LCFRS: parent→ nonterminals dominatingterminals regardless of position in sentence

Page 6: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous Constituents

Example:I Why did the chicken cross the road?I The chicken crossed the road to get to the other side.

Page 7: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Non-local information in PTB: traces

ROOT

SBARQ

WHADVP-1

WRB

Why

SQ

VBD

did

NP

DT

the

NN

chicken

VP

VB

cross

NP

DT

the

NN

road

ADVP

-NONE-

*T*-1

.

?

Figure : PTB-style annotation.

Page 8: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A tree with a discontinuous constituent.

Page 9: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous constituents

Motivation:I Handle flexible word-order, extraposition, &c.I Capture argument structureI Combine information from

constituency & dependency structures(NB: non-projectivity is a subset ofdiscontinuous phenomena)

Page 10: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous treebanks

Treebanks with discontinuous constituents:German/Negra: Skut et al. (1997). An annotation scheme

for free word order languages.Dutch/Alpino: van der Beek (2002). The Alpino

dependency treebank.English/PTB (after conversion): Evang & Kallmeyer (2011).

PLCFRS Parsing of English DiscontinuousConstituents.

Swedish, Polish, . . .

Page 11: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A tree with a discontinuous constituent.

Context-Free Grammar (CFG)NP(ab)→ DT(a) NN(b)

Page 12: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A tree with a discontinuous constituent.

Linear Context-Free Rewriting System (LCFRS)VP2(a,bc)→WHADVP(a) VB(b) NP(c)

Page 13: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Discontinuous treesROOT

SBARQ

SQ

VP

WHADVP

WRBWhy

VBcross

NP

DTthe

NNroad

VBDdid

NP

DTthe

NNchicken

.?

Figure : A tree with a discontinuous constituent.

Linear Context-Free Rewriting System (LCFRS)VP2(a,bc)→WHADVP(a) VB(b) NP(c)SQ(abcd)→ VBD(b) NP(c) VP2(a,d)

Page 14: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Linear Context-Free Rewriting Systems: I

LCFRS are a generalization of CFG:⇒ rewrite tuples, trees or graphs!

linear: each variable on the left occurs once on theright & vice versa

context-free: apply productions based on what theyrewrite

rewriting system: i.e., formal grammar

Vijay-Shanker, Weir, Joshi (1987): Structural descriptionsproduced by various grammar formalisms

Page 15: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Linear Context-Free Rewriting Systems: I

LCFRS are a generalization of CFG:⇒ rewrite tuples, trees or graphs!

linear: each variable on the left occurs once on theright & vice versa

context-free: apply productions based on what theyrewrite

rewriting system: i.e., formal grammar

Vijay-Shanker, Weir, Joshi (1987): Structural descriptionsproduced by various grammar formalisms

Page 16: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Linear Context-Free Rewriting Systems: II

LCFRS are weakly equivalent to:I Combinatory Categorial GrammarI Tree-Adjoining GrammarI Synchronous Context-Free GrammarI Multiple Context-Free GrammarI Minimalist GrammarI &c. . . .⇒ LCFRS form a ‘lingua franca’ formalism.

Page 17: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Complexity of LCFRS parsing

I LCFRS are Mildly Context-Sensitive grammarformalisms.

I Parsing an LCFRS has polynomial time complexity:

O(nrϕ)

where . . .I r is the number of non-terminals (rank)I ϕ is the maximum number of components covered by

a non-terminal (fan-out).

⇒ infinite 2D hierarchy of languages betweencontext-free and context-sensitive.

I Both CFG & LCFRS ∈ LOGCFL

Page 18: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Parsing with LCFRS

I An LCFRS can be parsed with tabular parsingalgorithm (similar to CKY):

I Agenda-based probabilistic parser for LCFRS(Kallmeyer & Maier 2010);extended to produce k-best derivations

I Rules can be read off from treebank,relative frequencies give probabilistic LCFRS (PLCFRS)

Kallmeyer & Maier (2010). Data-driven parsing with probabilistic linearcontext-free rewriting systems.

Page 19: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

But . . .

0 10 20 30 40

length

0

200

400

600

800

1000

1200

Avg. C

PU

tim

e (

seco

nds)

PLCFRS

Negra dev. set, gold tags

Page 20: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

PCFG approximation of PLCFRS

S

B*1 B*2

A X C Y D

a b c b d

S

B

A X C Y D

a b c b d

I Transformation is reversibleI Increased independence assumption:⇒ every component is a new node

I Language is a superset of original PLCFRS⇒ coarser, overgenerating PCFG (‘split-PCFG’)

Boyd (2007). Discontinuity revisited.

Page 21: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Coarse-to-fine pipeline

G0

G1

G2

Split-PCFG

PLCFRS

a largegrammar

Treebankgrammars

Mildlycontext-sensitive

prune parsing with Gm+1 by only consideringitems in k-best Gm derivations.

Page 22: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

With coarse-to-fine

0 10 20 30 40

length

0

200

400

600

800

1000

1200Avg

. C

PU

tim

e (

seco

nd

s)

PLCFRS (k=10,000)Split-PCFGPLCFRS

Negra dev. set, gold tags

Page 23: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Data-Oriented Parsing

Treebank grammartrees⇒ productions + rel. frequencies⇒ problematic independence assumptions

Data-Oriented Parsing (DOP)trees⇒ fragments + rel. frequenciesfragments are arbitrarily sized chunksfrom the corpus

consider all possible fragments from treebank. . .and “let the statistics decide”

Scha (1990): Lang. theory and lang. tech.; competence and performanceBod (1992): A computational model of language performance

Page 24: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

DOP fragmentsS

VP2

VB NP ADJis Gatsby rich

S

VP2

VB NP ADJis rich

S

VP2

VB NP ADJGatsby rich

S

VP2

VB NP ADJis Gatsby

S

VP2

VB NP ADJrich

S

VP2

VB NP ADJGatsby

S

VP2

VB NP ADJis

S

VP2

VB NP ADJ

S

VP2

NPGatsby

S

VP2

NPVP2

VB ADJis rich

VP2

VB ADJrich

VP2

VB ADJis

VP2

VB ADJ

NPGatsby

VBis

ADJrich

P(f ) = count(f )∑f ′∈F count(f ′) where F = { f ′ | root(f ′) = root(f ) }

Note: discontinuous frontier non-terminalsmark destination of components

Page 25: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

DOP derivation

S

VP2

VB NP ADJrich

VBis

NPGatsby

S

VP2

VB NP ADJis Gatsby rich P(d) = 0.2

S

VP2

VB NP ADJis

NPGatsby

ADJrich

S

VP2

VB NP ADJis Gatsby rich P(d) = 0.3

Derivations for this tree P(t) = 0.5

P(d) = P(f1 ◦ · · · ◦ fn) =∏f∈d

p(f )

P(t) = P(d1) + · · ·+ P(dn) =∑

d∈D(t)

∏f∈d

p(f )

Page 26: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Tree-Substitution Grammar

This DOP model (Bod 1992) is based onTree-Substitution Grammar (TSG):

I Weakly equivalent to CFG; typically stronglyequivalent as well; advantage is in stochastic powerof Probabilistic TSG.

I Same Context-Free property as CFG, but multipleproductions applied at once;⇒ captures more structural relations than PCFG.

I CFG backbone can be replaced with LCFRS to getDiscontinuous Tree-Substitution Grammar (PTSGLCFRS).

Page 27: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

DOP implementation issues

Exponential number of fragmentsdue to all-fragments assumption

I Can use DOP reduction (Goodman 2003);weight of fragments spread over many productions

I Can restrict number of fragmentsby depth or frontier nodes &c.,⇒ but: not data-oriented!

Goodman (2003): Efficient parsing of DOP with PCFG-reductions

Page 28: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Double-DOP

I Extract fragments that occurat least twice in treebank

I For every pair of trees,extract maximal overlapping fragments

I Can be extracted in linear average timeI Number of fragments is small enough

to parse with directly

Sangati & Zuidema (2011). Accurate parsing w/compact TSGs: Double-DOPvan Cranenburgh (2012). Extracting tree fragments in linear average time

Page 29: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

From fragments to grammar

I Fragments mapped to unique rules,relative frequencies as probabilities

I Remove internal nodes,leaves root node, substitution sites & terminalsX → X1 . . .Xn

I Reconstruct derivations after parsing

S

VP2

VB NP ADJrich

S

VB NP rich

Sangati & Zuidema (2011). Accurate parsing w/compact TSGs: Double-DOP

Page 30: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Preprocessing

I Remove function labelsI Binarize w/markovization (h=1, v=1)I Simple unknown word model

I Rare words replaced by features(model 4 from Stanford parser)

I Reserve probability massfor unseen (tag, word) pairs

Page 31: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Results w/Double-DOP

F1 %DOP reduction 74.3Double-DOP

(Negra dev set ≤ 40 words, gold tags)

Page 32: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Results w/Double-DOP

F1 %DOP reduction 74.3Double-DOP 76.3

(Negra dev set ≤ 40 words, gold tags)

Also: parsing 3× faster, grammar 3× smaller

Page 33: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Results w/Double-DOP

k=50 k=5000F1 % F1 %

DOP reduction 74.3 73.5Double-DOP 76.3

(Negra dev set ≤ 40 words, gold tags)

What if we reduce pruning?

Page 34: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Results w/Double-DOP

k=50 k=5000F1 % F1 %

DOP reduction 74.3 73.5Double-DOP 76.3 77.7

(Negra dev set ≤ 40 words, gold tags)

What if we reduce pruning?⇒ For Double-DOP, performance does not deterioriatewith expanded search space.

Page 35: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Parsing results: test sets

Parser, treebank F1 EXGERMAN

HaNi2008, Tiger 75.3 32.6CrBo2013, Tiger 78.8 40.8

ENGLISHSaZu2011, wsj 87.9 33.7EvKa2011, disc. wsj 79.0CrBo2013, disc. wsj 85.6 31.3

DUTCHCrBo2013, Lassy 77.0 35.2

CrBo: van Cranenburgh & Bod (2013);HaNi: Hall & Nivre (2008); SaZu: Sangati & Zuidema (2011).

Page 36: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Can DOP handle discontuinity without LCFRS?

Split-PCFG⇓

PLCFRS⇓

PLCFRS Double-DOP77.7 % F141.5 % EX

Split-PCFG

Split-Double-DOP

78.1 % F142.0 % EX

Answer: Yes!

Fragments can capture discontinuous contexts

Page 37: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Can DOP handle discontuinity without LCFRS?

Split-PCFG⇓

PLCFRS⇓

PLCFRS Double-DOP77.7 % F141.5 % EX

Split-PCFG

Split-Double-DOP78.1 % F142.0 % EX

Answer: Yes!

Fragments can capture discontinuous contexts

Page 38: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Summary

Formally, CFG = TSGCFG ⊂ LCFRSStochastically, PCFG ⊂ PTSGCFGEmpirically, PTSGLCFRS ≈ PTSGCFG

Page 39: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Limitations of parsing with (approximated) LCFRS

I Just as with CFG, an LCFRS production strictly covers asingle configuration of constituents.

I Consider:A man who was seeking a unicorn walked into the room.vs.A man walked into the room who was seeking a unicorn.

I Discontinuous constituents are relatively rare⇒ sparse data problem

Page 40: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Unsupervised discontinuous parsing?

Problems:I A sentence with n words has

O(n2) possible continuous constituents.The same sentence may haveO(2n) discontinuous constituents.

I For continuous constituents, adjacent co-occurrenceis the key heuristic to determine constituency.Is there a surface heuristic for discontinuousconstituency?⇒ Morphology in case of morphologically-richlanguages.

Page 41: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Unsupervised discontinuous parsing?

Problems:I A sentence with n words has

O(n2) possible continuous constituents.The same sentence may haveO(2n) discontinuous constituents.

I For continuous constituents, adjacent co-occurrenceis the key heuristic to determine constituency.Is there a surface heuristic for discontinuousconstituency?⇒ Morphology in case of morphologically-richlanguages.

Page 42: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Unsupervised discontinuous parsing

Rule-based: e.g., Yoshinaka (2010), Polynomial-timeidentification of multiple context-freelanguages from positive data andmembership queries

Statistical: Stanford parser jointly parsers with a PCFGand a dependency model.Same technique could be applied to inducea PCFG and a non-projective dependencymodel.

Page 43: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Unsupervised discontinuous parsing

Rule-based: e.g., Yoshinaka (2010), Polynomial-timeidentification of multiple context-freelanguages from positive data andmembership queries

Statistical: Stanford parser jointly parsers with a PCFGand a dependency model.Same technique could be applied to inducea PCFG and a non-projective dependencymodel.

Page 44: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Other non-context-free challenges: supervised

Adjunction: More compact grammar and less sparsitywhen adjuncts can be inserted with anoperation of the grammar formalism;e.g., Tree-Insertion Grammar.

Grammatical functions: Most models reconstructfunctions after parsing using machine learning

Multiple parents: e.g.,John1 walks and [John1] talks to Mary

Page 45: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Other non-context-free challenges: supervised

Adjunction: More compact grammar and less sparsitywhen adjuncts can be inserted with anoperation of the grammar formalism;e.g., Tree-Insertion Grammar.

Grammatical functions: Most models reconstructfunctions after parsing using machine learning

Multiple parents: e.g.,John1 walks and [John1] talks to Mary

Page 46: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Other non-context-free challenges: supervised

Adjunction: More compact grammar and less sparsitywhen adjuncts can be inserted with anoperation of the grammar formalism;e.g., Tree-Insertion Grammar.

Grammatical functions: Most models reconstructfunctions after parsing using machine learning

Multiple parents: e.g.,John1 walks and [John1] talks to Mary

Page 47: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Other non-context-free challenges: unsupervised

Free word-order: Separate surface from canonicalstructure by reordering.But: Information on canonical orderis not in treebanks.Some order variation is pragmatic,some affects syntax/semantics.

Condition on lexical relations as opposed tojust structural relations.But: which relations are relevant?

Exploit sentence context; e.g., current topic, otherdiscourse effects

Page 48: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Other non-context-free challenges: unsupervised

Free word-order: Separate surface from canonicalstructure by reordering.But: Information on canonical orderis not in treebanks.Some order variation is pragmatic,some affects syntax/semantics.

Condition on lexical relations as opposed tojust structural relations.But: which relations are relevant?

Exploit sentence context; e.g., current topic, otherdiscourse effects

Page 49: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Other non-context-free challenges: unsupervised

Free word-order: Separate surface from canonicalstructure by reordering.But: Information on canonical orderis not in treebanks.Some order variation is pragmatic,some affects syntax/semantics.

Condition on lexical relations as opposed tojust structural relations.But: which relations are relevant?

Exploit sentence context; e.g., current topic, otherdiscourse effects

Page 50: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

THE END

Codes: http://github.com/andreasvc/disco-dop

Papers: http://staff.science.uva.nl/~acranenb

Page 51: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Wait . . . there’s more

BACKUP SLIDES

Page 52: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Binarization

I mark heads of constituentsI head-outward binarization (parse head first)I no parent annotation: v = 1I horizontal Markovization: h = 1

X

A B C D E F

X

XA

XB

XF

XE

XD

XC$

A B C D E FKlein & Manning (2003): Accurate unlexicalized parsing.

Page 53: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Implementation details

I Cython: combines best of both worldsC speed, Python convenience.

I Where it matters, manual memorymanagement & layout;

I e.g., grammar rules & edges compactly packed inarrays of structs.

I 14k lines of code; FWIW:

Collins parser C 3k (!?)bitpar C++ 6kBerkeley parser Java 58kCharniak & Johnson parser C++ 62kStanford parser Java 151k

Page 54: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Efficiency (Negra dev set)

10 20 30 40

# words

0

10

20

30

40

cpu t

ime (

seco

nds)

dopplcfrspcfg

Page 55: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Parser setuptraincorpus='wsj02-21.export',testcorpus='wsj24.export',corpusdir='../../dptb',stages=[

dict(name='pcfg', mode='pcfg',split=True, markorigin=True,

),dict(

name='plcfrs', mode='plcfrs',prune=True, splitprune=True, k=10000,

),dict(

name='dop', mode='plcfrs',prune=True, k=5000,dop=True, usedoubledop=True, m=10000,estimator='dop1', objective='mpp',

),],[...]

Page 56: Data-Oriented Parsing and Discontinuous Constituents · 2020. 6. 25. · Data-OrientedParsing andDiscontinuousConstituents AndreasvanCranenburgh HuygensING InstituteforLogic,LanguageandComputation

Web-based interface