transition-based semantic role labeling using predicate argument clustering
DESCRIPTION
This paper suggests two ways of improving semantic role labeling (SRL). First, we introduce a novel transition-based SRL algorithm that gives a quite different approach to SRL. Our algorithm is inspired by shift-reduce parsing and brings the advantages of the transition-based approach to SRL. Second, we present a self-learning clustering technique that effectively improves labeling accuracy in the test domain. For better generalization of the statistical models, we cluster verb predicates by comparing their predicate argument structures and apply the clustering information to the final labeling decisions. All approaches are evaluated on the CoNLL’09 English data. The new algorithm shows comparable results to another state-of-the-art system. The clustering technique improves labeling accuracy for both in-domain and out-of-domain tasks.TRANSCRIPT
Transition-based Semantic Role LabelingUsing Predicate Argument Clustering
Jinho D. Choi & Martha PalmerUniversity of Colorado at Boulder
June 23rd, 2011
Workshop on Relational Models of Semantics
Thursday, June 23, 2011
Dependency-based SRL• Semantic role labeling
- Task of identifying arguments of each predicate and labeling them with semantic roles in relation to the predicate.
• Dependency-based semantic role labeling
- Advantages over constituent-based semantic role labeling.
• Dependency parsing is faster (2.29 milliseconds / sentence).
• Dependency structure is more similar to predicate argument structure.
- Labels headwords instead of phrases.
• Still can recover the original semantic chunks for the most of time (Choi and Palmer, LAW 2010).
2
Thursday, June 23, 2011
Dependency-based SRL• Constituent-based vs. dependency-based SRL
3
He opened the door with his foot
NP
PPNP
VPNP
S
at ten
PP
NP
Agent Theme Instrument Temporal
Hethe
door with at
his
foot ten
Thursday, June 23, 2011
Dependency-based SRL• Constituent-based vs. dependency-based SRL
4
He the door with his foot at ten
ARG1 ARG2 TMP
opened
ARG0
thedoor with at
hisfoot ten
SBJ OBJ ADV TMP
opened
He
Thursday, June 23, 2011
Motivations• Do argument identification and classification need to be
in separate steps?
- They may require two different feature sets.
- Training them in a pipeline takes less time than as a joint-inference task.
- We have seen advantages of dealing with them as a joint-inference task in dependency parsing, why not in SRL?
5
Thursday, June 23, 2011
Transition-based SRL• Dependency parsing vs. dependency-based SRL
- Both try to find relations between word pairs.
- Dep-based SRL is a special kind of dep. parsing.
• It restricts the search only to top-down relations between predicate (head) and argument (dependent) pairs.
• It allows multiple predicates for each argument.
• Transition-based SRL algorithm
- Top-down, bidirectional search.
- Easier to develop a joint-inference system between dependency parsing and semantic role labeling.
6
→ More suitable for SRL
Thursday, June 23, 2011
Transition-based SRL• Parsing states
- (λ1, λ2, p, λ3, λ4, A)
- p - index of the current predicate candidate.
- λ1 - indices of lefthand-side argument candidates.
- λ4 - indices of righthand-side argument candidates.
- λ2,3 - indices of processed tokens.
- A - labeled arcs with semantic roles
• Initialization: ([ ], [ ], 1, [ ], [2, ..., n], ∅)
• Termination: (λ1, λ2, ∄, [ ], [ ], A)
7
Thursday, June 23, 2011
Transition-based SRL• Transitions
- No-Pred - finds the next predicate candidate.
- No-Arc← - rejects the lefthand-side argument candidate.
- No-Arc→ - rejects the righthand-side argument candidate.
- Left-Arc← - accepts the lefthand-side argument candidate.
- Right-Arc→ - accepts the righthand-side argument candidate.
8
Thursday, June 23, 2011
9
• Left-Arc : John ← wants
• No-Pred
• Right-Arc : wants → to
• No-Arc x 3
• Shift
• No-Arc x 2
• No-Pred
• Left-Arc : John ← buy
• No-Arc
• Right-Arc : buy → car
car
a
buy
to
wants
John ← wants
John
John
wants
John to
wants → tobuy
a
car
John
wants
to
car
a
buyto buy
to
wants
John
a
John ← buy
car
buy → car
λ1 λ2 λ3 λ4 A
John1 wants2 to3 buy4 a5 car6
A0 A1
A0 A1
Thursday, June 23, 2011
Features• Baseline features
- N-gram and binary features(similar to ones in Johansson and Nugues, EMNLP 2008).
- Structural features.
10
wants
PRP:John TO:to
VB:buy
SBJ OPRD
IM
Subcategorization of “wants”
Path from “John” to “buy”
Depth from “John” to “buy”
SBJ ← VV → OPRD
PRP ↑ LCALCA ↓ TO ↓ VB
SBJ ↑ LCALCA ↓ OPRD ↓ IM1 ↑ LCA ↓ 2
Thursday, June 23, 2011
Features• Dynamic features
- Derived from previously identified arguments.
- Previously identified argument label of warg.
- Label of the very last predicted numbered argument of wpred.
- These features can narrow down the scope of expected arguments of wpred.
11
John1 wants2 to3 buy4 a5 car6
A0 A1
A0 A1
Thursday, June 23, 2011
Experiments• Corpora
- CoNLL’09 English data.
- In-domain task: the Wall Street Journal.
- Out-of-domain task: the Brown corpus.
• Input to our semantic role labeler
- Automatically generated dependency trees.
- Used our open-source dependency parser, ClearParser.
• Machine learning algorithm
- Liblinear L2-L1 SVM.
12
Thursday, June 23, 2011
Experiments• Results
- AI - Argument Identification.
- AC - Argument Classification.
13
TaskIn-domainIn-domainIn-domain Out-of-domainOut-of-domainOut-of-domain
TaskP R F1 P R F1
BaselineAI 92.57 88.44 90.46 90.96 81.57 86.01
BaselineAI+AC 87.20 83.31 85.21 77.11 69.14 72.91
+Dynamic
AI 92.38 88.76 90.54 90.90 82.25 86.36+Dynamic AI+AC 87.33 83.91 85.59 77.41 70.05 73.55
JN’08 AI+AC 88.46 83.55 85.93 77.67 69.63 73.43
Thursday, June 23, 2011
Summary• Introduced a transition-based SRL algorithm, showing
near state-of-the-art results.
- No need to design separate systems for argument identification and classification.
- Make it easier to develop a joint-inference system between dependency parsing and semantic role labeling.
• Future work
- Several techniques, designed to improve transition-based parsing, can be applied (e.g., dynamic programming, k-best ranking)
- We can apply more features, such as clustering information, to improve labeling accuracy.
14
Thursday, June 23, 2011
Predicate Argument Clustering• Verb clusters can give more generalization to the
statistical models.
- Clustering verbs using bag-of-words, syntactic structure.
- Clustering verbs using predicate argument structure.
• Self-learning clustering
- Cluster verbs in the test data using automatically generated predicate argument structures.
- Cluster verbs in the training data using the verb clusters found in the test data as seeds.
- Re-run our semantic role labeler on the test data using the clustering information.
15
Thursday, June 23, 2011
Predicate Argument Clustering
16
want 1 1 1 1 00s 0sbuy 1 1 1 0 10s 0s
A0 A1 john:A0 to:A1 car:A1... ...Verb
Table 2 shows parsing states generated by our al-gorithm. Our experiments show that this algorithmgives comparable results against another state-of-the-art system.
3 Predicate argument clustering
Some studies showed that verb clustering informa-tion could improve performance in semantic role la-beling (Gildea and Jurafsky, 2002; Pradhan et al.,2008). This is because semantic role labelers usuallyperform worse on verbs not seen during training, forwhich the clustering information can provide usefulfeatures. Most previous studies used either bag-of-words or syntactic structure to cluster verbs; how-ever, this may or may not capture the nature of predi-cate argument structure, which is more semanticallyoriented. Thus, it is preferable to cluster verbs bytheir predicate argument structures to get optimizedfeatures for semantic role labeling.
In this section, we present a self-learning clus-tering technique that effectively improves labelingaccuracy in the test domain. First, we perform se-mantic role labeling on the test data using the algo-rithm in Section 2. Next, we cluster verbs in the testdata using predicate argument structures generatedby our semantic role labeler (Section 3.2). Then, wecluster verbs in the training data using the verb clus-ters we found in the test data (Section 3.3). Finally,we re-run our semantic role labeler on the test datausing the clustering information. Our experimentsshow that this technique gives improvement to la-beling accuracy for both in and out-of domain tasks.
3.1 Projecting predicate argument structureinto vector space
Before clustering, we need to project the predicateargument structure of each verb into vector space.Two kinds of features are used to represent thesevectors: semantic role labels and joined tags ofsemantic role labels and their corresponding wordlemmas. Figure 2 shows vector representations ofpredicate argument structures of verbs, want andbuy, in Figure 1.
Initially, all existing and non-existing features areassigned with a value of 1 and 0, respectively. How-ever, assigning equal values to all existing featuresis not necessarily fair because some features have
want 1 1 1 1 00s 0sbuy 1 1 1 0 10s 0s
A0 A1 john:A0 to:A1 car:A1... ...Verb
Figure 2: Projecting the predicate argument structure ofeach verb into vector space.
higher confidence, or are more important than theothers; e.g., ARG0 and ARG1 are generally predictedwith higher confidence than modifiers, nouns givemore important information than some other gram-matical categories, etc. Instead, we assign each ex-isting feature with a value computed by the follow-ing equations:
s(lj |vi) =1
1 + exp(−score(lj |vi))
s(mj , lj) =
�1 (wj �= noun)
exp( count(mj ,lj)�∀k count(mk,lk)
)
vi is the current verb, lj is the j’th label of vi, andmj is lj’s corresponding lemma. score(lj |vi) is ascore of lj being a correct argument label of vi; thisis always 1 for training data and is provided by ourstatistical models for test data. Thus, s(lj |vi) is anapproximated probability of lj being a correct argu-ment label of vi, estimated by the logistic function.s(mj , lj) is equal to 1 if wj is not a noun. If wj isa noun, it gets a value ≥ 1 given a maximum likeli-hood of mj being co-occurred with lj .2
With the vector representation, we can apply anykind of clustering algorithm (Hofmann and Puzicha,1998; Kamvar et al., 2002). For our experiments,we use k-best hierarchical clustering for test data,and k-means clustering for training data.
3.2 Clustering verbs in test dataGiven automatically generated predicate argumentstructures in the test data, we apply k-best hierar-chical clustering; that is, a relaxation of classical hi-erarchical agglomerative clustering (from now on,HAC; Ward (1963)), to find verb clusters. UnlikeHAC that merges a pair of clusters at each iteration,k-best hierarchical clustering merges k-best pairs at
2Assigning different weights for nouns resulted in moremeaningful clusters in our experiments. We will explore addi-tional grammatical category specific weighting schemes in fu-ture work.
Table 2 shows parsing states generated by our al-gorithm. Our experiments show that this algorithmgives comparable results against another state-of-the-art system.
3 Predicate argument clustering
Some studies showed that verb clustering informa-tion could improve performance in semantic role la-beling (Gildea and Jurafsky, 2002; Pradhan et al.,2008). This is because semantic role labelers usuallyperform worse on verbs not seen during training, forwhich the clustering information can provide usefulfeatures. Most previous studies used either bag-of-words or syntactic structure to cluster verbs; how-ever, this may or may not capture the nature of predi-cate argument structure, which is more semanticallyoriented. Thus, it is preferable to cluster verbs bytheir predicate argument structures to get optimizedfeatures for semantic role labeling.
In this section, we present a self-learning clus-tering technique that effectively improves labelingaccuracy in the test domain. First, we perform se-mantic role labeling on the test data using the algo-rithm in Section 2. Next, we cluster verbs in the testdata using predicate argument structures generatedby our semantic role labeler (Section 3.2). Then, wecluster verbs in the training data using the verb clus-ters we found in the test data (Section 3.3). Finally,we re-run our semantic role labeler on the test datausing the clustering information. Our experimentsshow that this technique gives improvement to la-beling accuracy for both in and out-of domain tasks.
3.1 Projecting predicate argument structureinto vector space
Before clustering, we need to project the predicateargument structure of each verb into vector space.Two kinds of features are used to represent thesevectors: semantic role labels and joined tags ofsemantic role labels and their corresponding wordlemmas. Figure 2 shows vector representations ofpredicate argument structures of verbs, want andbuy, in Figure 1.
Initially, all existing and non-existing features areassigned with a value of 1 and 0, respectively. How-ever, assigning equal values to all existing featuresis not necessarily fair because some features have
want 1 1 1 1 00s 0sbuy 1 1 1 0 10s 0s
A0 A1 john:A0 to:A1 car:A1... ...Verb
Figure 2: Projecting the predicate argument structure ofeach verb into vector space.
higher confidence, or are more important than theothers; e.g., ARG0 and ARG1 are generally predictedwith higher confidence than modifiers, nouns givemore important information than some other gram-matical categories, etc. Instead, we assign each ex-isting feature with a value computed by the follow-ing equations:
s(lj |vi) =1
1 + exp(−score(lj |vi))
s(mj , lj) =
�1 (wj �= noun)
exp( count(mj ,lj)�∀k count(mk,lk)
)
vi is the current verb, lj is the j’th label of vi, andmj is lj’s corresponding lemma. score(lj |vi) is ascore of lj being a correct argument label of vi; thisis always 1 for training data and is provided by ourstatistical models for test data. Thus, s(lj |vi) is anapproximated probability of lj being a correct argu-ment label of vi, estimated by the logistic function.s(mj , lj) is equal to 1 if wj is not a noun. If wj isa noun, it gets a value ≥ 1 given a maximum likeli-hood of mj being co-occurred with lj .2
With the vector representation, we can apply anykind of clustering algorithm (Hofmann and Puzicha,1998; Kamvar et al., 2002). For our experiments,we use k-best hierarchical clustering for test data,and k-means clustering for training data.
3.2 Clustering verbs in test dataGiven automatically generated predicate argumentstructures in the test data, we apply k-best hierar-chical clustering; that is, a relaxation of classical hi-erarchical agglomerative clustering (from now on,HAC; Ward (1963)), to find verb clusters. UnlikeHAC that merges a pair of clusters at each iteration,k-best hierarchical clustering merges k-best pairs at
2Assigning different weights for nouns resulted in moremeaningful clusters in our experiments. We will explore addi-tional grammatical category specific weighting schemes in fu-ture work.
score of lj being a label of vi
• Vector representation
- Semantic role labels, semantic role labels + word lemmas.
-
-max. likelihood of mj co-occurring with lj
Thursday, June 23, 2011
Predicate Argument Clustering• Clustering verbs in the test data
- K-best hierarchical agglomerative clustering.
• Merges k-best pairs at each iteration.
• Uses a threshold to dynamically determine the top k clusters.
- We set another threshold for early break-out.
• Clustering verbs in the training data
- K-means clustering.
• Starts with centroids estimated from the clusters found in the test data.
• Uses a threshold to filter out verbs not close enough to any cluster.
17
Thursday, June 23, 2011
Experiments• Results
18
TaskIn-domainIn-domainIn-domain Out-of-domainOut-of-domainOut-of-domain
TaskP R F1 P R F1
BaselineAI 92.57 88.44 90.46 90.96 81.57 86.01
BaselineAI+AC 87.20 83.31 85.21 77.11 69.14 72.91
+Dynamic
AI 92.38 88.76 90.54 90.90 82.25 86.36+Dynamic AI+AC 87.33 83.91 85.59 77.41 70.05 73.55
+Cluster
AI 92.62 88.90 90.72 90.87 82.43 86.44+Cluster AI+AC 87.43 83.92 85.64 77.47 70.28 73.70
JN’08 AI+AC 88.46 83.55 85.93 77.67 69.63 73.43
Thursday, June 23, 2011
Conclusion• Introduced self-learning clustering technique, potential
for improving labeling accuracy in the new domain.
- Need to try with large scale data to see a clear impact of the clustering.
- Can also be improved by using different features or clustering algorithms.
• ClearParser open-source project
- http://code.google.com/p/clearparser/
19
Thursday, June 23, 2011
Acknowledgements• We gratefully acknowledge the support of the National
Science Foundation Grants CISE-IIS- RI-0910992, Richer Representations for Machine Translation, a subcontract from the Mayo Clinic and Harvard Children’s Hospital based on a grant from the ONC, 90TR0002/01, Strategic Health Advanced Research Project Area 4: Natural Language Processing, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
20
Thursday, June 23, 2011