how much do word embeddings encode about syntax? jacob andreas and dan klein uc berkeley
TRANSCRIPT
How much do word embeddings encode about syntax?
Jacob Andreas and Dan KleinUC Berkeley
Everybody loves word embeddings
few most
that the
a eachthisevery
[Collobert 2011][Collobert 2011, Mikolov 2013, Freitag 2004, Schuetze 1995, Turian 2010]
What might embeddings bring?
Cathleen complained about the magazine’s shoddy editorial quality .
Mary
executiveaverage
Three hypotheses
Vocabulary expansion(good for OOV words)
Statistic pooling(good for medium-frequency words)
Embedding structure(good for features)
Cathleen
Mary
averageeditorial
executive
transitivity
tense
Vocabulary expansion:
Embeddings help handling of out-of-vocabulary words
Cathleen
Mary
Vocabulary expansion
John
Mary
Pierre
yellow
enormous
hungry
Cathleen
Vocabulary expansion
John
Mary
Pierre
yellow
enormous
hungry
Cathleen complained about the magazine’s shoddy editorial quality.
Cathleen
Mary
Vocab. expansion results
60
65
70
75
80
85
90
95
100
91.13 91.22
Baseline +OOV
Vocab. expansion results
70
71
72
73
74
75
71.8872.20
Baseline +OOV
(300 sentences)
Statistic pooling hypothesis:
Embeddings help handling ofmedium-frequency words
averageeditorial
executive
Statistic pooling
executive
kindgiant
editorial
average
{NN, JJ}
{NN}
{NN, JJ}
{JJ}
{NN}
Statistic pooling
executive
kindgiant
editorial
average
{NN, JJ}
{NN, JJ}
{NN, JJ}
{JJ, NN}
{NN, JJ}
Statistic pooling
executive
kindgiant
editorial
average
{NN, JJ}
{NN}
{NN, JJ}
{JJ}
{NN}
editorial NN
editorialNN
Statistic pooling results
Baseline +Pooling60
65
70
75
80
85
90
95
100
91.13 91.11
Vocab. expansion results
70
71
72
73
74
75
71.8872.21
Baseline +Pooling
(300 sentences)
Embedding structure hypothesis:
The organization of the embedding spacedirectly encodes useful features
transitivity
tense
Embedding structure
vanisheddined vanishing
dining
devoured
assassinateddevouring
assassinating
“transitivity”
“tense”
dined dinedVBD VBD
[Huang 2011]
Embedding structure results
60
65
70
75
80
85
90
95
100
91.13 91.08
Baseline +Features
Embedding structure results
70
71
72
73
74
75
71.88
70.32
Baseline +Features
(300 sentences)
To summarize
60
65
70
75
80
85
90
95
100Baseline+OOV+Pooling+Features
(300 sentences)
Combined results
60
65
70
75
80
85
90
95
100
90.70 90.11
Baseline +OOV+Pooling
Vocab. expansion results
70
71
72
73
74
75
71.8872.21
Baseline
(300 sentences)
+OOV+Pooling
What about…
• Domain adaptation?(no significant gain)
• French?(no significant gain)
• Other kinds of embeddings?(no significant gain)
Why didn’t it work?• Context clues often provide enough information to
reason around words with incomplete / incorrect statistics
• Parser already has a robust OOV, small count models
• Sometimes “help” from embeddings is worse than nothing:
bifurcate Soap homered Paschi tuning unrecognized
What about other parsers?
• Dependency parsers(continuous repr. as syntactic abstraction)
• Neural networks(continuous repr. as structural requirement)
[Henderson 2004, Socher 2013][Henderson 2004, Socher 2013, Koo 2008, Bansal 2014]
Conclusion
• Embeddings provide no apparent benefit to state-of-the-art parser for:– OOV handling– Parameter pooling– Lexicon features
• Code online at http://cs.berkeley.edu/~jda