exploring hindi treebank and propbank ashwini vaidya 30 th september 2015

74
Exploring Hindi Treebank and PropBank Ashwini Vaidya 30 th September 2015

Upload: christian-horn

Post on 14-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Exploring Hindi Treebank and PropBank

Ashwini Vaidya30th September 2015

Hindi and Urdu Treebank Project

• 400,000 Hindi words* and 200,000 Urdu words• Multi-layered and multi-representational– Syntactic + Semantic annotation (‘layers’)– Dependency + Phrase structure (‘representation’)

• Hindi corpus consists of newswire text from ‘Amar Ujala’

• Create a linguistic resource for Hindi in the tradition of other Treebanks (Penn Treebank, Prague Dependency Treebank)

*~21,000 sentences

Outline

• 3 Representations– DS: Dependency Structure– PB: PropBank (lexical predicate-argument

structure)– PS: Phrase Structure

• Mapping between DS and PB• Linguistic phenomena: Causative verbs

Hindi Dependency Treebank

• The Hindi/Urdu Treebank is annotated using a dependency grammar framework

• Framework used is CPG: (Computational Paninian Grammar) – Panini’s ‘karaka’ theory adapted for annotation

scheme

Dependency labelsकि� या�

do

रा�म नेRaam erg

��मwork

�लyesterday

k1 k7t k2

Example of a dependency tree. Labels denote relations between a modifier and a modified

• Karakas are the relations between head and child nodes in the treebank

• Relations are depicted between word chunks and not individual tokens– E.g. a verb chunk can consist of a finite verb along with its

auxiliaries

An Example

Example:

meraa badZaa bhaaii bahuta phala khaataa hai

‘My elder brother eats lots of fruits.’

COLING`12

An Example (Contd...)

Morph Analysis:

meraa <fs af= root=meraa, cat=pron, gend=any, num=sg, pers=1, case=o>

badZaa <fs af= root=badZaa, cat=adj, gend=m, , , >

bhaaii <fs af= root=bhaaii, cat=n, gend=m, num=sg, pers=3, case=d>

bahuta <fs af= root=bahuta, cat=adj, gend=any, , , >

phala <fs af= root=phala, cat=n, gend=m, num=any, pers=3, case=d>

khaataa <fs af= root=khaa, cat=v, gend=m, num=sg, pers=3, TAM=taa> hai <fs af= root=hai, cat=v, gend=any, num=any, pers=3, >

COLING`12

An Example (Contd ..)

POS Tagging:

meraa_PRP baDzaa_JJ bhaaii_NN bahuta_QF

phala_NN khaataa_VM hai_VAUX

Chunking:

((meraa_PRP))_NP

((baDzaa_JJ bhaaii_NN))_NP

((bahuta_QF phala_NN))_NP

((khaataa_VM hai_VAUX))_VG

COLING`12

An Example (Contd...)

Dependency Relation

COLING`12

Selected dependency labels [TOTAL = 43 ]

k1 karta (similar to agent/doer)k2 karma (similar to patient/theme)k3 instrumentk4 beneficiaryk5 sourcek7t temporal locationk7p spatial locationk1s noun complement k2p destinationpk1, mk1, jk1 Causer, mediator-causer, causee rh causert purposersp durationadv adverb (manner)pof Part-of (complex predicates)ccof conjunctionfragof fragment-of

Classification of labels in tagset

• Based on their syntactic and semantic behaviour, we find 6 categories of labels (Vaidya and Husain, 2010)

k1 k2 k3 k4 k5 k2p k1s k7p k7t rh rt nmod pof fragof ccof

Invariant Syntactic labels

Local semantic labels Global semantic labels

Mod labels

‘pof’ type labels

‘ccof’ labels

Invariant syntactic labels

• k1 ‘karta’ and k2 ‘karma’ • Invariant across syntactic alternations like voice• E.g. aaj khuub mithai khaai gai

Today many sweets eaten go.pst‘Today many sweets were eaten’

• The label for ‘sweets’ is k2, although ‘sweets’ is now the passivized subject

• This property allows for mapping with PropBank roles Arg0, Arg1

Local semantic labels

• Relation between verb and dependent is ‘local’• These labels are “relevant to the verb meaning in

question”

E.g. Ram ne Mohan ko kahaani sunaai Ram erg Mohan acc story told.perf

‘Ram told Mohan a story’• Mohan is a ‘k4’, a beneficiary is a local semantic label• However, it is a label specific to certain verbs only, e.g.

denaa ‘to give’, kahnaa ‘to say’

Local semantic labels

• Other local semantic labels include– k4a ‘anubhava karta’, experiencer, with verbs like

mila ‘find’, dikha ‘see’, laga ‘feel’– k2p ‘goal’ with verbs like pahuMca ‘reach’, jaa ‘go’

• The interpretation of these labels is closely bound with the meaning of the verb

Global semantic labels

• These labels are relevant “across different verbs and verb meanings”

E.g maine aaj pustak khariidi I-erg today book bought‘I bought a book today’

• Here ‘aaj (today)’ has the label k7t or ‘time’ which does not change across different verb meanings

Global semantic labels

• More examples of global semantic labels are:– k7p ‘place’ – rh ‘reason’– rsp ‘duration’

• Not tied to the meaning of a verb

Invariant syntactic relations

k1 karta (similar to agent/doer)k2 karma (similar to patient/theme)k3 instrumentk4 beneficiaryk5 sourcek2p destinationk1s noun complementk7p spatial locationk7t temporal locationrh causert purposersp durationnmod noun modificationpof Part-of (complex predicates)fragof Fragment-ofccof conjunction

Local semantic relations

Global semantic relations

Modifier relations

pof-type relations

ccof-type relations

Dependency Structure

• Local vs. global distinctions would identify the core participants in the verb’s event

• Mapping to other frameworks that make distinctions between the core and non-core participants will be easier

• We examine such a mapping with the PropBank labels in the following section

Outline

– DS: Dependency Structure ✓– PB: PropBank (lexical predicate-argument

structure)• Mapping between DS and PB• Linguistic phenomena: Causative verbs

Proposition Bank

• A PropBank is a large annotated corpus of predicate-argument information

• A set of semantic roles is defined for each verb• A syntactically parsed corpus is then tagged

with verb-specific semantic role information

English PropBank

• English PropBank envisioned as the next level of Penn Treebank (Kingsbury & Palmer, 2003)

• Added a layer of predicate-argument information to the Penn Treebank

• Broad in its coverage- covering every instance of a verb and its semantic arguments in the corpus

English PropBank Annotation

• Two steps are involved in annotation– Choose a sense ID for the predicate – Annotate the arguments of that predicate with

semantic roles• This requires two components: frame files and

PropBank tagset

PropBank Frame files

• PropBank defines semantic roles on a verb-by-verb basis

• This is defined in a verb lexicon consisting of frame files

• Each predicate will have a set of roles associated with a distinct usage

• A polysemous predicate can have several rolesets within its frame file

An example

• John rings the bell

ring.01 Make sound of bellArg0 Causer of ringing

Arg1 Thing rung

Arg2 Ring for

An example

• John rings the bell• Tall aspen trees ring the lake

ring.01 Make sound of bellArg0 Causer of ringing

Arg1 Thing rung

Arg2 Ring for

ring.02 To surroundArg1 Surrounding entity

Arg2 Surrounded entity

An example

• [John] rings [the bell] • [Tall aspen trees] ring [the lake]

ring.01 Make sound of bellArg0 Causer of ringing

Arg1 Thing rung

Arg2 Ring forring.02 To surroundArg1 Surrounding entity

Arg2 Surrounded entity

Ring.01

Ring.02

An example

• [JohnARG0] rings [the bellARG1]

• [Tall aspen treesARG1] ring [the lakeARG2]

ring.01 Make sound of bellArg0 Causer of ringing

Arg1 Thing rung

Arg2 Ring forring.02 To surroundArg1 Surrounding entity

Arg2 Surrounded entity

Ring.01

Ring.02

PropBank annotation pane in Jubilee

English PropBank Tagset

• Numbered arguments Arg0, Arg1, and so on until Arg4

• Modifiers with function tags e.g. ArgM-LOC (location) , ArgM-TMP (time), ArgM-PRP (purpose)

• Modifiers give additional information about when, where or how the event occurred

Using PropBank

• As a computational resource– Train semantic role labellers (Pradhan et al, 2005)– Question answering systems (with FrameNet)– Project semantic roles onto a parallel corpus in another

language (Pado & Lapata, 2005)• For linguists, to study various phenomena related

to predicate-argument structure

Developing Hindi PropBank

• Making a PropBank resource for a new language– Linguistic differences• Capturing relevant language-specific phenomena

– Annotation practices• Maintain similar annotation practices

– Consistency across PropBanks

32

Hindi PropBank

• PropBank annotation on dependency trees has some advantages

• Hindi Treebank uses a large set of dependency labels that have rich semantic information

दि� या gave

रा�म नेRaam erg

औरात ��woman dat

पै�सेmoney

Arg0k1

Arg2k4

Arg1k2

33

Annotating Hindi PropBank

– HPB consists of 26 labels including arguments and modifiers

• Numbered arguments: An individual verb’s semantic arguments E.g. Arg0-3• Modifiers: not specific to the verb are labeled ArgM

E.g. ArgM-LOC, ArgM-PRP

– 44546 individual verb tokens in Hindi PropBank, where 40% are complex predicates

34

Annotating Hindi PropBank

Label DescriptionARG0 Agent

ARG1 Patient, theme, undergoer

ARG2 Beneficiary

ARG3 Instrument

35

Annotating Hindi PropBank

Label DescriptionARG0 Agent

ARG1 Patient, theme, undergoer

ARG2 Beneficiary

ARG3 Instrument

ARG2-ATRARG2-LOC

attribute ARG2-GOLARG2-SOU

goal

location source

Numbered arguments

36

Annotating Hindi PropBank

Label DescriptionARG0 Agent

ARG1 Patient, theme, undergoer

ARG2 Beneficiary

ARG3 Instrument

ARG2-ATRARG2-LOC

attribute ARG2-GOLARG2-SOU

goal

location source

ARGC causer

ARGA secondary causer Causative

Numbered arguments

37

Annotating Hindi PropBank

Label DescriptionARG0 Agent

ARG1 Patient, theme, undergoer

ARG2 Beneficiary

ARG3 Instrument

ARG2-ATRARG2-LOC

attribute ARG2-GOLARG2-SOU

goal

location source

ARGA causer

ARGA-MNS Intermediate causer

ARG0-GOL, ARG0-MNS causees

ARGM-VLV Verb-verb construction

ARGM-PRX Noun-verb construction

Causative

Complex predicate

Numbered arguments

38

Annotating Hindi PropBank

Label Description Label Description

ARGM-ADV adverb ARGM-CAU cause

ARGM-DIR direction ARGM-DIS discourse

ARGM-EXT extent ARGM-LOC location

ARGM-MNR manner ARGM-MNS means

ARGM-MOD modal ARGM-NEG negation

ARGM-PRP purpose ARGM-TMP time

•Other modifier labels

Outline

– DS: Dependency Structure ✓– PB: PropBank (lexical predicate-argument

structure) ✓• Mapping between DS and PB• Linguistic phenomena: Causative verbs

40

Hindi Dependency Treebank

• The DS tagset has labels that are in some ways fairly similar to PB– Verb specific labels: k1- 5– Verb modifier labels: k7p, k7t, rh etc.– Non dependencies like pof, ccof for complex

predicates and co-ordination

41

Dependency structure and PropBank

• In ‘Ram gave the woman money’, the dependents of give are k1 (primary doer), k2 (patient) and k4 (recipient)

• These correspond fairly neatly to Arg0, Arg1 and Arg2

• Dependency labels and to some extent, the tree structure are helpful for deriving PropBank annotations

42

Dependency structure and PropBank

• A mapping between the dependency karaka labels (HDT) and Hindi PropBank labels (HPB) is feasible

• Such a mapping would increase speed of annotation, improve inter-annotator agreement and help in a full fledged semantic role labeling task

43

Label comparison

•Using linguistic intuition, we can compare HDT labels with the numbered arguments in HPB

44

Label comparison

• Similarly, linguistic intuition gives us the mapping from HDT for HPB modifiers

45

Label comparison

• These mappings are included in the HPB frame files, for example, the verb ‘A: to come’

• Only for numbered arguments

Roleset Usage Rule

A.01 To come (path) k1 Arg1k2p Arg2-GOL

A.03 to arrive k1 Arg0k2p Arg2-GOLk5 Arg2-SOU

46

Automatic mapping of HDT to HPB

• A rule based, probabilistic system for automatic mapping

• We use two kinds of resources:– Annotated corpus [Treebank+ PropBank]• 32,300 tokens, 2005 predicates

– Frame files with mapping rules

47

Argument classification

• We use two kinds of rules to carry out automatic mapping– Empirically derived rules• Using corpus statistics associated with dependency &

PropBank labels

– Linguistically motivated rules• Derived from linguistic intuition & captured in frame

files

48

Linguistically motivated rules

• Helpful for predicates not seen in training data• We use the mapping captured in the frame

files• Applied after empirically derived rules• Limitation: available for numbered arguments

only

49

Empirically derived rules

• These rules estimate the maximum likelihood of each PropBank label (pbrel) being associated with a feature tuple

Rule (id, v, drel) = argmaxi P(pbreli)

50

Empirically derived rules

• The feature tuple consists of– Id: Predicate lemma OR Predicate ID– V : Voice (passive or active, given in HDT)– Drel: Dependency label

• For example, the tuple (xe ‘to give’, active, k1)

Example of the rulesFeatures Count PropBank labels

xe.01_active_k1(give)

32 Arg0: 0.93

Arg1: 0.03

Arg2:0.03

xe.01_active_k2 65 Arg1:0.95

Arg2:0.01

Arg0:0.01

xe.01_active_k4 34 Arg2:0.94

Arg0:0.02

•Associate the probability of each label in combination with a particular feature tuple •We use only 3: roleset ID, voice, dependency label•For the verb give, we get the correct mapping to the Hindi labels

52

Evaluation

• 32,300 tokens of annotated data• Ten-fold cross validation for evaluation

53

Evaluation

• Empirically derived rulesDist. Precision Recall F1 score

ALL 100.00 90.59 47.92 62.69

54

Evaluation

• Empirically derived rulesDist. Precision Recall F1 score

ALL 100.00 90.59 47.92 62.69

ARG0 17.50 95.83 67.27 79.05

ARG1 27.28 94.47 61.62 74.59

ARG2 3.42 81.48 37.93 51.76

55

Evaluation

• Empirically derived rulesDist. Precision Recall F1 score

ALL 100.00 90.59 47.92 62.69

ARG0 17.50 95.83 67.27 79.05

ARG1 27.28 94.47 61.62 74.59

ARG2 3.42 81.48 37.93 51.76

ARG2-ATR 2.54 94.55 40.31 56.52

ARG2-GOL 1.61 64.29 21.95 32.73

ARG2-SOU 0.83 78.26 42.86 55.38

56

Evaluation

• Empirically derived rulesDist. Precision Recall F1 score

ALL 100.00 90.59 47.92 62.69

ARG0 17.50 95.83 67.27 79.05

ARG1 27.28 94.47 61.62 74.59

ARG2 3.42 81.48 37.93 51.76

ARG2-ATR 2.54 94.55 40.31 56.52

ARG2-GOL 1.61 64.29 21.95 32.73

ARG2-SOU 0.83 78.26 42.86 55.38

ARGM-ADV 3.50 31.82 3.93 7.00

ARGM-CAU 1.44 50.00 5.48 9.88

ARGM-LOC 10.77 83.80 27.42 41.32

ARGM-TMP 7.01 74.63 14.04 23.64

57

EvaluationPRECISION RECALL F1 SCORE

Empirically derived rules

90.59 47.92 62.69

Linguistically motivated rules

89.80 55.28 68.44

58

EvaluationPRECISION RECALL F1 SCORE

Empirically derived rules

90.59 47.92 62.69

Linguistically motivated rules

89.80 55.28 68.44

Numbered Argument AccuracyPRECISION RECALL F-SCORE

Empirically derived rules

93.63 58.76 72.21

Linguistically motivated

rules

91.87 72.36 80.96

59

Evaluation

• Linguistically motivated rules improve the recall with a slight drop in the precision

• With the most frequent PropBank labels, empirically derived rules perform well

• More data should improve the performance for modifier arguments

60

Summary

• We demonstrate the similarities between Hindi dependency labels and PropBank semantic arguments

• Using three kinds of rules, we achieve a mapping with 93% confidence for almost half the data

• Initial experiments show that mapping reduces annotation time by 33%

Outline

• 3 Representations– DS: Dependency Structure ✔– PB: PropBank (lexical predicate-argument

structure) ✔• Mapping between DS and PB ✔• Linguistic phenomena: Causative verbs

Linguistic issues: Causatives

• Syntactic/ semantic pheomena (“what”)– Relative clause– Causatives– Complex predicates …etc

• Representational issues (“how”)– Dependency/phrase structure

• Lexical semantics

Causatives

• Direct causative: khaa + -aa khilaa• Indirect causative: khilaa + -vaa khilvaa • Adding the causative morpheme –aa to

cause someone to do X.• It is also possible to add another causative

morpheme –vaa to cause A to cause B to do X.

Causatives

• Problem: Should the relation between causative verbs and their underlying forms be represented?

• Base form (khaa)• Causativized forms (khilaa, khilvaa)

Causatives

• In PropBank frame files, we decided to represent this relation– The same frame file with separate rolesets used for base

and causatives– Frame labels also represent this relation

Causatives

raam neArg0 khaanaArg1 khaayaaRam erg food eat-perf

‘Ram ate the food’

Roleset id: KA.01 to eat

Arg0 eater

Arg1 the thing being eaten

Causatives

raam neArg0 khaanaArg1 khaayaaRam erg food eat-perf

‘Ram ate the food’

mohan neArgA raam koArg0-GOL khaanaArg1 khilaayaaMohan erg Ram dat food eat-caus-perf

‘Mohan made Ram eat the food’

Roleset id: KA.01 to eat

Arg0 eater

Arg1 the thing being eaten

Roleset id: KilA.01 to feed

ArgA feeder

Arg0-GOL eater

Arg1 the thing being eaten

Causatives

raam neArg0 khaanaArg1 khaayaaRam erg food eat-perf

‘Ram ate the food’

mohan neArgA raam koArg0-GOL khaanaArg1 khilaayaaMohan erg Ram dat food eat-caus-perf

‘Mohan made Ram eat the food’

sita neArgA mohan seArgA-MNS raam koArg0-GOL khaanaArg1 khilvaayaaSita erg Mohan instr Ram acc food eat-ind.caus-caus-perf

‘Sita, through Mohan made Ram eat the food ’

Roleset id: KA.01 to eat

Arg0 eater

Arg1 the thing being eaten

Roleset id: KilA.01 to feed

ArgA feeder

Arg0-GOL eater

Arg1 the thing being eaten

Roleset id: KilvA.01 to cause to be fedArgA Causer of

feedingArgA-MNS feeder

Arg0-GOL Eater

Arg1 the thing eaten

Causatives

raam neArg0 ticketArg1 kharidaaRam erg food bought-perf

‘Ram bought the ticket’

Roleset id: khariid.01 to buy

Arg0 buyer

Arg1 the thing being bought

Causatives

raam neArg0 ticketArg1 kharidaaRam erg food bought-perf

‘Ram bought the ticket’

mohan neArgA raam seArg0-MNS ticketArg1 khariidvaayaMohan erg Ram dat food eat-caus-perf

‘Mohan made Ram eat the food’

Roleset id: khariid.01 to buy

Arg0 buyer

Arg1 the thing being bought

Roleset id: kharidvaa.01 to feedArgA Cause to buy

Arg0-MNS causee

Arg1 the thing being bought

Causatives

raam neArg0 ticketArg1 kharidaaRam erg food bought-perf

‘Ram bought the ticket’

mohan neArgA raam seArg0-MNS ticketArg1 khariidvaayaMohan erg Ram inst ticjet buy-caus-perf

‘Mohan made Ram buy the ticket’

sita neArgA mohan dwaraArgA-MNSraam seArg0-MNSticketArg1 kharidvaaya

Sita erg Mohan by Ram inst ticket buy-ind.caus-perf`Sita through Mohan made Ram buy the ticket’

Roleset id: khariid.01 to buy

Arg0 buyer

Arg1 the thing being bought

Roleset id: kharidvaa.01 to

ArgA Cause to buy

Arg0-MNS causee

Arg1 the thing being bought

Representing causees

• The ARG0 label represents agents, but for certain causativized forms, this is further split into:– ARG0-GOL: affected agent– ARG0-MNS: non-affected agent– ARGA : causer

• For any other intermediate causers: ARGA-MNS

Other linguistic issues

• Empty categories• Representation of complex predicates• Intransitive verbs: Unaccusatives and

Unergatives• Syntactic phenomena: small clauses, relative

clauses, co-ordination

Questions?