gale banks 11/9/06 1 parsing arabic: key aspects of treebank annotation seth kulick ryan gabbard...

26
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus

Upload: amelia-horn

Post on 29-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

GALE Banks 11/9/06

1

Parsing Arabic: Key Aspects of Treebank Annotation

Seth KulickRyan GabbardMitch Marcus

GALE Banks 11/9/06

2

Outline

Summary of recent results Part of Speech/Treebank “mismatches” Components of Flat NPs Test and Train Results Conclusion

GALE Banks 11/9/06

3

Recent Results

Effect of Sentence Splitting – S->S (wa) S (wa) SBreaking these improves F-measure by

1.25%Investigating automatic accuracy of S splitting

Effect of “Spurious NPs” in coordination(NP (NP x) and (NP y)) changed to

(NP x and y and z)Improves F-measure by 0.5%

GALE Banks 11/9/06

4

Pos/Treebank Mismatches

“Ideal” – XP projection headed by XIdeal and Reality in the PTB and ATB

Ambiguities for (Pos word) makes parser’s job harder

GALE Banks 11/9/06

5

VP headed by noun

6% of VPs in ATB have a nonverbal headChanged heads to have new POS tag – “DV”Temporary approximation to current

annotation changes0.7 increase in F-measure

(VP (NOUN mugAdar+at+i- [departure]) (NP-SBJ (POSS_PRON –hi [his]) (NP-OBJ (DET+NOUN Al+bayot+a [the house]) (DET+ADJ Al+>aboyaD+a [the white])))

GALE Banks 11/9/06

6

NP headed by adj – #1

(S (NP-SBJ (PRON_1S –niy [I]) (NP-PRD (ADJ saEiyd+N [happy]))

ADJ heads NP-PRD, elsewhere ADJP-PRD

(VP (PV+PVSUFF_SUBJ kAn+a [be+he]) (NP-SBJ-1 (-NONE- *T*)) (ADJP-PRD (ADJ saEiyd+AF happy) (PP … [with the voting])))

GALE Banks 11/9/06

7

NP headed by adj - #2

(VP (IV ta+Eomal+a [they work]) (NP-SBJ rAbiT+ap+u Al+maxAtyr+i [league of the mukhtars(village chiefs)]) (NP-ADV (ADJ dA}im+AF [always]))

ADJ heads NP-ADV, elsewhere ADVP,ADJP

(VP (IV na+>omal+a [we hope for] (NP-SBJ (-NONE- *)) (ADVP (ADJ dA}im+AF [always]))

(VP (IV ya+SiH~+u he/it+be correct (NP-SBJ-1 (-NONE- *T*)) (ADJP (ADJ dA}im+AF [always])

GALE Banks 11/9/06

8

ADJP headed by noun

(S (NP-SBJ (NOUN >um~ah+At+u- [mothers]) (POSS_PRON_3P -hum [their])) (ADJP-PRD (NOUN >amiyrokiy~+At+N [American]))

Also as ADJ

(NP (NOUN >um~ah+At+K [mothers]) (ADJ >amiyrokiy~+At+K [American]))

GALE Banks 11/9/06

9

ADVP headed by conj

(S (ADVP (FOCUS_PART >am~A [as_for/concerning])) (NP-TPC-1 Haqiyb+ap+u Al+xArijiy~+ap+I [the foreign ministry’s portfolio]) (ADVP (CONJ fa- [and/so])) (VP ….

(CONJ fa-) also as child of S

(S (S …) (PUNC ,) (CONJ fa- [and/so]) (S…)

GALE Banks 11/9/06

10

Mismatches in ATB and PTB

ATB3 PTB2.0

VP 6.0% 0.5%

NP 5.0% 1.6%

ADJP 7.3% 23.4%

ADVP 45.37% 8.0%

PP 0.8% 1.8%

GALE Banks 11/9/06

11

XP/X mismatches - SummaryThis matters:

headless VPs to “DV” modification : +0.7%PTB: 23.4% mismatch for ADJP

Overall: 88.28 ADJP: 70.68

Real-life linguistic complexityNeed guidelines – visual prop timeSome automatic changes likely

No guarantee of level of improvement, but:Should be a priority

GALE Banks 11/9/06

12

Flat NPs

Flat NPs – only (Pos word) childrenExperiment –

Evaluate with Flat NPs as different bracketAffects overall score

(Gold)(NP (NOUN -<ijorA’+i [conducting]) (NP (NOUN {inotixAb+At+K [elections]) (ADJ niyAbiy~+ap+K [representative])))

GALE Banks 11/9/06

13

Flat NPs

(Gold)(NP (NOUN -<ijorA’+i [conducting]) (NP (NOUN {inotixAb+At+K [elections]) (ADJ niyAbiy~+ap+K [representative])))

(Test)(NP (NN -<ijorA’+i [conducting]) (NNS {inotixAb+At+K [elections]) (JJ niyAbiy~+ap+K [representative]))

Under regular evaluation, top NPs match

GALE Banks 11/9/06

14

Flat NPs

(Gold)(NP (NOUN -<ijorA’+i [conducting]) (FLATNP (NOUN {inotixAb+At+K [elections]) (ADJ niyAbiy~+ap+K [representative])))

(Test)(FLATNP (NN -<ijorA’+i [conducting]) (NNS {inotixAb+At+K [elections]) (JJ niyAbiy~+ap+K [representative]))

With FlatNP evalution, no match

GALE Banks 11/9/06

15

Flat NPs

Importance of Flat NPs30% of brackets are Flat NPsErrors percolate Up

ATB3 score on Flat NPs not good enoughUnclear why, but need some things from ATB

Flat NPs Overall

PTB2.0 94.20 87.54

ATB3 86.77 77.27

GALE Banks 11/9/06

16

Flat NPs

Clear statement of what can go in flat NPsRegular expressions for each headCertain things fall out:

Questionable categories – e.g. (DET+NOUN DET+NOUN) (NP Al+baHor+i [the sea] Al+>aHomar+i [the red])

Nouns that occur before a head noun are limited to a small class : quantifiers

GALE Banks 11/9/06

17

Flat NPs

(NP (NOUN kul~+a [every/all/each_one]) (DET+NOUN Al+nuSuws+I [the texts] (DET+ADJ Al+tijAriy~+ap+I [the business])

Quantifier as prenominal modifier in flat NP

Quantifier as taking NP complement

(NP (NOUN kul~+a [every/all/each_one]) (NP (DET+NOUN Al+duwal+i [the countries]) (DET+ADJ A+Earabiy~+ap+I [the Arabic]))

Quantifiers take NP complement 15%

GALE Banks 11/9/06

18

Flat NPs - SummaryReal-life linguistic complexity

Need guidelines for NP structure, quantifiersSome automatic changes likelyMaybe different POS tag for NOUNs with

different distribution?

No guarantee of level of improvement, but:Should be a priority

GALE Banks 11/9/06

19

Test on Train

ATB3 lower, but not so muchAnalysis of dependency errors

All <=40

PTB2.0 96.80 97.10

ATB3 94.31 95.34

GALE Banks 11/9/06

20

Dependency Analysis

PTB2.0 ATB3

% all Fmeas %all Fmeas

31.08% 99.19% 16.33% 95.83%

0.0% N/A 10.13% 97.08%

NPB

head mod

NP

head NP

% all = % of all dependenciesNPB = “base NP”, non-recursive NPMore evidence that minimal NPs matter a lot

GALE Banks 11/9/06

21

Dependency Analysis

PTB2.0 ATB3

% all Fmeas %all Fmeas

5.23% 94.78 5.74% 89.05

0.04% 30.40 1.28% 65.08

NP

NPB PP

NP

NP PP

Why the difference in PP adjoining to NP, and not just NPB?

GALE Banks 11/9/06

22

PP attachment in PTB

Adjuncts at the same level

Okay Not Okay

(NP (NP ….) (PP ….) (PP …))

(NP (NP (NP …) (PP …))

(PP …))

This is true for ATB also

GALE Banks 11/9/06

23

PP attachment in PTB

(NP (NP streets) (PP of (NP (NP the city) (PP of (NP Long Beach)) (PP in (NP the state…)))))

(NP (NP streets) (PP of (NP (NP (NP the city) (PP of (NP Long Beach))) (PP in (NP the state…)))))

First is okay, second is notPPs in PTB do not adjoin to recursive NPsPPs in ATB do, because of Al<DAfp

GALE Banks 11/9/06

24

PP attachment in PTB and ATB

(NP (NP streets) (PP of (NP (NP (NP the city) (PP of (NP Long Beach))) (PP in (NP the state…)))))

(NP ($awAriE [streets]) (NP (NP madinyn+ap [the city]) (NP luwnog byt$ [Long Beach])) (PP fiy [in] (NP wilAy+ap [the state] .. ))))

PTB: PP adjoining to recursive NP – bad structure

ATB: PP adjoining to recursive NP – good structure

GALE Banks 11/9/06

25

Dependency Analysis

PTB2.0 ATB3

% all Fmeas %all Fmeas

5.23% 94.78 5.74% 89.05

0.04% 30.40 1.28% 65.08

NP

NPB PP NP

NP PP

Parser distinguishes NPB, helps for PTB.A wider range of attachment possibilities for ATBChallenge for the parser

GALE Banks 11/9/06

26

Conclusion

We need guidelines We need to create the guidelines

Interaction - Parsing and TreebankIdentify useful consistency checksRun as part of each release

Better understanding of problematic areasWhat sort of changes are necessary?Parsing – automatic transformationsTreebank – Pos changes, etc.

Proper time allocation?