machine learning 2 inductive dependency parsing joakim nivre uppsala universityväxjö university...
TRANSCRIPT
Machine Learning 2
Inductive Dependency Parsing
Joakim Nivre
Uppsala University Växjö University
Department of Linguistics and Philology
School of Mathematics and Systems Engineering
Inductive Dependency Parsing
• Dependency-based representations …– have restricted expressivity but provide a
transparent encoding of semantic structure.– have restricted complexity in parsing.
• Inductive machine learning …– is necessary for accurate disambiguation.– is beneficial for robustness.– makes (formal) grammars superfluous.
Dependency Graph
0 1 2 3 4 5 6 7 8 9
Economic news had little effect on financial markets .
JJ NN VBD JJ NN IN JJ NNS .
ROOT
NMOD SBJ NMOD NMOD
OBJ PMOD
NMOD
P
Key Ideas
• Deterministic:– Deterministic algorithms for building dependency graphs
(Kudo and Matsumoto 2002, Yamada and Matsumoto 2003, Nivre 2003)
• History-based:– History-based models for predicting the next parser action
(Black et al. 1992, Magerman 1995, Ratnaparkhi 1997, Collins 1997)
• Discriminative:– Discriminative machine learning to map histories to actions
(Veenstra and Daelemans 2000, Kudo and Matsumoto 2002, Yamada and
Matsumoto 2003, Nivre et al. 2004)
Guided Parsing
• Deterministic parsing:– Greedy algorithm for disambiguation– Optimal strategy given an oracle
• Guided deterministic parsing:– Guide = Approximation of oracle– Desiderata:
• High prediction accuracy• Efficient implementation (constant time)
– Solution:• Discriminative classifier induced from treebank data
Learning
• Classification problem (S T)– Parser states: S = { s | s = (1, …, p) }
– Parser actions: T = { t1, …, tm }
• Training data:– D = { (si-1, ti) | ti(si-1) = si in gold standard derivation s1, …, sn }
• Learning methods:– Memory-based learning– Support vector machines– Maximum entropy modeling– …
Feature Models
• Model P: PoS: t1, top, next, n1, n2
• Model D: P + DepTypes: t.hd, t.ld, t.rd, n.ld
• Model L2: D + Words: top, next
• Model L4: L2 + Words: top.hd, n1
hdld rd ld
.th next.top . n1…… …… n2 n3t1
Stack Input
Experimental Results (MBL)
Swedish English
AS EM AS EM
U L U L U L U L
P 77.4 70.1 26.6 17.8 79.0 76.1 14.4 10.0
D 82.5 75.1 33.5 22.2 83.4 80.5 21.9 17.0
L2 85.6 81.5 39.1 30.2 86.6 84.8 29.9 26.2
L4 85.9 81.6 39.8 30.4 87.3 85.6 31.1 27.7
• Results:– Dependency features help
– Lexicalisation helps …
– … up to a point (?)
Parameter Optimization
Model = L4 + PoS of n3 Swedish English
Parameter Manual Param Manual Param
Number of neighbors (-k) 5 11 7 19
Distance metric (-m) MVDM MVDM MVDM MVDM
Switching threshold (-L) 3 5 5 2
Feature weighting (-w) None GR None GR
Distance weighted class voting (-d) ID IL ID IL
Unlabeled attachment score (ASU) 86.2 86.0 87.7 86.8
Labeled attachment score (ASL) 81.9 82.0 85.9 84.9
• Learning algorithm parameter optimization:– Manual (Nivre 2005) vs. paramsearch (van den Bosch 2003)
Learning Curves
65
70
75
80
85
90
1 2 3 4 5 6 7 8 9 10
Training sections
Att
ac
hm
en
t s
co
re
D U
L2 U
D L
L2 L
65
70
75
80
85
90
1 2 3 4 5 6 7 8
Training sections
Att
ac
hm
en
t s
co
re
D U
L2 U
D L
L2 L
Swedish:– Attachment score (U/L)
– Models: D, L2
– 10K tokens/section
English:– Attachment score (U/L)
– Models: D, L2
– 100K tokens/section
Dependency Types: Swedish
• High accuracy (84% labeled F):IM (marker infinitive) 98.5%PR (preposition noun) 90.6%UK (complementizer verb) 86.4%VC (auxiliary verb main verb) 86.1%DET (noun determiner) 89.5%ROOT 87.8%SUB (verb subject) 84.5%
• Medium accuracy (76% labeled F 80%):ATT (noun modifier) 79.2%CC (coordination) 78.9%OBJ (verb object) 77.7%PRD (verb predicative) 76.8%ADV (adverbial) 76.3%
• Low accuracy (labeled F 70%):INF, APP, XX, ID
Dependency Types: English
• High accuracy (86% labeled F):VC (auxiliary verb main verb) 95.0%NMOD (noun modifier) 91.0%SBJ (verb subject) 89.3%PMOD (preposition modifier) 88.6%SBAR (complementizer verb) 86.1%
• Medium accuracy (73% labeled F 83%):ROOT 82.4%OBJ (verb object) 81.1% VMOD (verb modifier) 76.8%AMOD (adjective/adverb modifier) 76.7%PRD (predicative) 73.8%
• Low accuracy (labeled F 70%):DEP (null label)
MaltParser
• Software for inductive dependency parsing:– Freely available for research and education
(http//www.msi.vxu.se/users/nivre/research/MaltParser.html)
• Version 0.3:– Parsing algorithms:
• Nivre (2003) (arc-eager, arc-standard)• Covington (2001) (projective, non-projective)
– Learning algorithms:• MBL (TIMBL)• SVM (LIBSVM)
– Feature models:• Arbitrary combinations of part-of-speech features, dependency
type features and lexical features
Auxiliary tools:
• MaltEval
• MaltConverter
• Proj
CoNLL-X Shared Task
Language #Tokens #DTypes ASU ASL
Japanese 150K 8 92.2 90.3English* 1000K 12 89.7 88.3Bulgarian 200K 19 88.0 82.5Chinese 350K 134 88.0 82.2Swedish 200K 64 87.9 81.3Danish 100K 53 86.9 82.0Portuguese 200K 55 86.0 81.5German 700K 46 85.0 82.0Italian* 40K 17 82.9 75.7Czech 1250K 82 80.1 72.8Spanish 90K 21 79.0 74.3Dutch 200K 26 76.0 71.7Arabic 50K 27 74.0 61.7Turkish 60K 26 73.8 63.0Slovene 30K 26 73.3 62.2
Possible Projects
• CoNLL Shared Task:– Work on one or more languages – With or without MaltParser– Data sets available
• Parsing spoken language:– Talbanken05: Swedish treebank with written and
spoken data, cross-training experiments– GSLC: 1.2M corpus of spoken Swedish