robust constituent-to-dependency conversion for english

Robust Constituent-to-Dependency Conversion for English

Jinho D. Choi & Martha PalmerUniversity of Colorado at Boulder

December 3rd, 2010

The 9th International Workshop on Treebanks and Linguistic Theories

Dependency Structure• What is dependency?

- Syntactic or semantic relation between a pair of words.

• Phrase structure vs. dependency structure

- Constituents vs. dependencies

LOC PMODNMOD

places in this city

TMP

events year

S

He

NP VP

NPbought

a car

bought

carHe

a

root

SBJ OBJ

NMOD

2

Dependency Graph• For a sentence s = w1 .. wn , a dependency graph Gs = (Vs, Es)

- Vs = {w0 = root, w1, ... , wn}

- Es = {(wi, r, wj) : wi ≠ wj, wi ∈ Vs, wj ∈ Vs - {w0}, r ∈ Rs}

- Rs = a set of all dependency relations in s

• A well-formed dependency graph

- Unique root, single head, connected, acyclic

- Projective vs. non-projective.

➜ dependency tree

aboughtHeroot car that is redyesterday

aboughtHeroot car that is red

3

Why Constituent-to-Dependency? • Dependency Treebanks in English

- Prague English Dependency Treebank (≈ 0.28M words)

- ISLE English Dependency Treebank (?)

• Constituent Treebanks in English

- The Penn Treebank (> 1M words)

- The OntoNotes Treebank (> 2M words)

- The Genia Treebank (≈ 0.48M words)

- The Craft Treebank (≈ 0.79M words)

• By performing the conversion, we get larger corpora with more diversities.

4

Previous Conversion Tools• Constituent-to-dependency conversion tools

- Penn2Malt.

- LTH conversion tool (Johansson and Nugues, 2007).

- Stanford dependencies (Marneffe et al., 2006).

• LTH conversion tool

- Used for CoNLL 2007 - 2009.

- Generates semantic dependencies from function tags and non-projective dependencies using empty categories.

- Customized for the original Penn Treebank.

• Penn Treebank style phrase structure has made some changes.

5

Changes in Penn Treebank Style Phrase Structure

• Tokenized hyphenated words, inserted NML phrases.

• Introduced some new phrase/token-level tags.

6

ADJP

NNP JJ

New York-based

ADJP

NML

New York

NNP NNP

based

HYPH VBN

-

S

He

NP VP

EDITEDmet

these people

NP NP

this group

PP

NPof

people

NP,

Updated Conversion Tool• Motivations

- The conversion tool needs to be updated as the phrase structure format changes.

- The conversion tool needs to perform robustly across different corpora.

- The conversion tool may generate dependency trees with empty categories.

• Contributions

- Less unclassified dependencies.

- Less unnecessary non-projective dependencies.

- More robust parsing performance across different corpora.

7

Constituent-to-Dependency Conversion• Conversion steps

1. Use head-percolation rules to find the head of each constituent, and make it the parent of all other nodes in the constituent.

2. For certain empty categories (e.g., *T*, *ICH*), make their antecedents children of the empty categories’ parents.

3. Label all dependencies by comparing relations between all head-dependent pairs.

• Head-percolation rules

• A set of rules that defines the head of each constituent.

• e.g., the head of a noun phrase ::= the rightmost noun.

8

Head-percolation Rules

9

Constituent-to-Dependency Conversion

10

far you expect *-2 run *T*-1toHow doroot

1:WHNP-1

NP

2:VP

3:VPNP

4:S

5:VPNP-2

6:SQ

7:SBARQ

1 2345

667

8:TOP

8

far you expect runtoHow doroot

Non-projective dependency

Small Clauses• Object predicates in adjectival small clauses

- LTH tool: direct children of the main verbs.

- Ours: direct children of the subject-nouns.

11

He

happyus

S

NP VP

made S-1

NP-SBJ ADJP-PRD

He made us happy

SBJROOT

root

OBJOPRD

He made us happy

SBJROOT

root

OBJ PRD

ARG0impeller to action

ARG1impelled predication

cause to be

Coordination• A phrase contains coordination if

- It contains a conjunction (CC) or a conj-phrase (CONJP)

- It is tagged as an unlike coordinated phrase (UCP).

- It contains a child annotated with a function tag, ETC.

• Find correct left and right conjunct pairs

12

Coordination

13

root We sold newboughtold books and then books

SNP

NPVP CC ADVP

NPVP

VP



S

S

said SBAR

*0* S

NP

Putin visited PP

in NP-2

April

VP

VP

Some

NP-1

NP=2

NP=1

some

VP

said

May

S,

Gapping Relations• Parsing gapping relations is hard.

- It is hard to distinguish them from coordination.

- There are not many instances to train with.

14

Gapping Relations

15

Some said1 Putin Aprilin some said2

SBJ PMODOBJ

visited May

ROOT

root

SBJ TMP

,

PGAP-SBJ

DEPGAP-PMOD

Some said1 Putin Aprilin some said2

SBJ PMODOBJ

visited May

ROOT

root

SBJ TMP

,

GAP

SBJP

TMP

Our

Empty Category Mappings

root I know his admiration for and trust in you*RNR*-1 *RNR*-1

NPPP

NP-1

NML

NMLCC

NPPPNML

NML

NMLNP

NP VP

S

• *RNR* (right node raising)

- At least two *RNR* nodes get to be referenced to the same antecedent.

- Map the antecedent to its closest *RNR* node.

16

LTH

Experiments• Corpora

- OntoNotes v4.0.

• Dependency parsers

- MaltParser: swap-lazy algorithm, LibLinear

- MSTParser: Chu-Liu-Edmonds algorithm, MIRA

- ClearParser: shift-eager algorithm, LibLinear

17

EBC EBN SIN XIN WEB WSJ ALLTrain 14,873 11,968 7,259 3,156 13,419 12,311 62,986Eval. 1,291 1,339 1,066 1,296 1,172 1,381 7,545Avg. 15.21 19.49 23.36 29.77 22.01 24.03 21.02


• Distributions of unclassified dependencies (in %)

• Distributions of non-projective dependencies (in %)

18

EBC EBN SIN XIN WEB WSJ ALLLTH 4.77 1.51 1.16 1.63 1.93 1.93 2.20Our 0.86 0.57 0.33 0.44 1.03 0.25 0.60

EBC EBN SIN XIN WEB WSJ ALLLTH - Dep 1.44 0.81 0.70 0.29 0.95 0.51 0.82Our - Dep 1.29 0.73 0.69 0.21 0.83 0.46 0.73

LTH - Sen 11.14 8.66 8.47 5.30 11.29 7.27 9.27Our - Sen 9.19 7.39 8.22 3.75 9.02 6.24 7.78


• Parsing accuracy when trained and tested on the same corpora (in %)

19

EBC EBN SIN XIN WEB WSJ ALLMalt - LTH 82.91 86.38 86.20 84.61 85.10 86.93 85.44Malt - Our 83.20 86.40 86.03 84.85 85.45 87.40 85.65

Clear - LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88Clear - Our 84.06 86.77 86.55 85.41 85.70 87.58 86.09

MST - LTH 81.64 85.47 85.02 84.10 84.05 85.93 84.49MST - Our 82.54 85.68 85.11 83.85 84.03 86.43 84.69


• Parsing accuracy when trained and tested on different corpora (in %)

20

EBC EBN SIN XIN WEB WSJ ALLMalt - LTH 74.80 82.40 81.74 79.39 80.42 80.59 80.01Malt - Our 75.60 83.05 81.81 81.46 80.81 81.17 80.85

Clear - LTH 76.37 83.16 83.53 81.29 81.83 81.29 81.36Clear - Our 77.14 84.16 83.66 82.45 82.26 82.32 82.16

MST - LTH 76.65 82.45 82.29 80.46 80.64 80.02 80.49MST - Our 77.20 83.06 82.52 80.88 80.82 81.04 81.01

All parsers gave significantly more accurate results when trained and tested on different corpora.

Conclusion• Aims

- Updated the conversion tool with respect to the changes in Penn Treebank style phrase structure.

- Robust conversion across different corpora.

• Contributions

- Less unclassified dependencies

- Less unnecessary non-projective dependencies.

- More robust parsing performance across different corpora.

• ClearParser open-source project

- http://code.google.com/p/clearparser/

21

http://code.google.com/p/clearparser/

http://code.google.com/p/clearparser/

Acknowledgements• Special thanks to Joakim Nivre for helpful insights.

• We gratefully acknowledge the support of the National Science Foundation Grants CISE- CRI-0551615, Towards a Comprehensive Linguistic Annotation and CISE-CRI 0709167, Collaborative: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. Any opinions, findings, and conclusions or recommendations expressed in this mate- rial are those of the authors and do not necessarily reflect the views of the National Science Foundation.

22

robust constituent-to-dependency conversion for english

Technology

dependency relations

dependency treebanks

dependency conversion10far

dependency trees

dependency graph gs

dependency parsers maltparser

lth conversion tool

constituent treebanks