how much do word embeddings encode about syntax? jacob andreas and dan klein uc berkeley

How much do word embeddings encode about syntax?

Jacob Andreas and Dan KleinUC Berkeley

Everybody loves word embeddings

few most

that the

a eachthisevery

[Collobert 2011][Collobert 2011, Mikolov 2013, Freitag 2004, Schuetze 1995, Turian 2010]

What might embeddings bring?

Cathleen complained about the magazine’s shoddy editorial quality .

Mary

executiveaverage

Three hypotheses

Vocabulary expansion(good for OOV words)

Statistic pooling(good for medium-frequency words)

Embedding structure(good for features)

Cathleen

Mary

averageeditorial

executive

transitivity

tense

Vocabulary expansion:

Embeddings help handling of out-of-vocabulary words

Cathleen

Mary

Vocabulary expansion

John

Mary

Pierre

yellow

enormous

hungry

Cathleen

Vocabulary expansion

John

Mary

Pierre

yellow

enormous

hungry

Cathleen complained about the magazine’s shoddy editorial quality.

Cathleen

Mary

Vocab. expansion results

60

65

70

75

80

85

90

95

100

91.13 91.22

Baseline +OOV


70

71

72

73

74

75

71.8872.20

Baseline +OOV

(300 sentences)

Statistic pooling hypothesis:

Embeddings help handling ofmedium-frequency words

averageeditorial

executive

Statistic pooling

executive

kindgiant

editorial

average

{NN, JJ}

{NN}

{NN, JJ}

{JJ}

{NN}

Statistic pooling

executive

kindgiant

editorial

average

{NN, JJ}

{NN, JJ}

{NN, JJ}

{JJ, NN}

{NN, JJ}

Statistic pooling

executive

kindgiant

editorial

average

{NN, JJ}

{NN}

{NN, JJ}

{JJ}

{NN}

editorial NN

editorialNN

Statistic pooling results

Baseline +Pooling60

65

70

75

80

85

90

95

100

91.13 91.11


70

71

72

73

74

75

71.8872.21

Baseline +Pooling

(300 sentences)

Embedding structure hypothesis:

The organization of the embedding spacedirectly encodes useful features

transitivity

tense

Embedding structure

vanisheddined vanishing

dining

devoured

assassinateddevouring

assassinating

“transitivity”

“tense”

dined dinedVBD VBD

[Huang 2011]

Embedding structure results

60

65

70

75

80

85

90

95

100

91.13 91.08

Baseline +Features

Embedding structure results

70

71

72

73

74

75

71.88

70.32

Baseline +Features

(300 sentences)

To summarize

60

65

70

75

80

85

90

95

100Baseline+OOV+Pooling+Features

(300 sentences)

Combined results

60

65

70

75

80

85

90

95

100

90.70 90.11

Baseline +OOV+Pooling


70

71

72

73

74

75

71.8872.21

Baseline

(300 sentences)

+OOV+Pooling

What about…

• Domain adaptation?(no significant gain)

• French?(no significant gain)

• Other kinds of embeddings?(no significant gain)

Why didn’t it work?• Context clues often provide enough information to

reason around words with incomplete / incorrect statistics

• Parser already has a robust OOV, small count models

• Sometimes “help” from embeddings is worse than nothing:

bifurcate Soap homered Paschi tuning unrecognized

What about other parsers?

• Dependency parsers(continuous repr. as syntactic abstraction)

• Neural networks(continuous repr. as structural requirement)

[Henderson 2004, Socher 2013][Henderson 2004, Socher 2013, Koo 2008, Bansal 2014]

Conclusion

• Embeddings provide no apparent benefit to state-of-the-art parser for:– OOV handling– Parameter pooling– Lexicon features

• Code online at http://cs.berkeley.edu/~jda

how much do word embeddings encode about syntax? jacob andreas and dan klein uc berkeley

Documents

unrecognized slide

editorial executive

sentences oov pooling

mary executive average

significant gain slide

dan klein uc berkeley

embeddings help

oov words statistic