improving distributional similarity with lessons learned from word embeddings omer levy yoav...

69
Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

Upload: ella-griffin

Post on 17-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

1

Improving Distributional Similaritywith Lessons Learned from Word

Embeddings

Omer Levy Yoav Goldberg Ido Dagan

Bar-Ilan UniversityIsrael

Page 2: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

2

Word Similarity & Relatedness

• How similar is pizza to pasta?• How related is pizza to Italy?

• Representing words as vectors allows easy computation of similarity

Page 3: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

3

Approaches for Representing Words

Distributional Semantics (Count)• Used since the 90’s• Sparse word-context PMI/PPMI matrix• Decomposed with SVD

Word Embeddings (Predict)• Inspired by deep learning• word2vec (Mikolov et al., 2013)• GloVe (Pennington et al., 2014)

Underlying Theory: The Distributional Hypothesis (Harris, ’54; Firth, ‘57)“Similar words occur in similar contexts”

Page 4: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

4

Approaches for Representing Words

Both approaches:• Rely on the same linguistic theory• Use the same data• Are mathematically related• “Neural Word Embedding as Implicit Matrix Factorization” (NIPS 2014)

• How come word embeddings are so much better?• “Don’t Count, Predict!” (Baroni et al., ACL 2014)

• More than meets the eye…

Page 5: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

5

What’s really improving performance?

The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)

• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …

Page 6: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

6

What’s really improving performance?

The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)

• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …

Page 7: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

7

What’s really improving performance?

The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)

• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …

Page 8: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

8

What’s really improving performance?

The Contributions of Word EmbeddingsNovel Algorithms(objective + training method)

• Skip Grams + Negative Sampling• CBOW + Hierarchical Softmax• Noise Contrastive Estimation• GloVe• …

New Hyperparameters(preprocessing, smoothing, etc.)

• Subsampling• Dynamic Context Windows• Context Distribution Smoothing• Adding Context Vectors• …

Page 9: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

9

Our Contributions

1) Identifying the existence of new hyperparameters• Not always mentioned in papers

2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between algorithms

Page 10: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

10

Our Contributions

1) Identifying the existence of new hyperparameters• Not always mentioned in papers

2) Adapting the hyperparameters across algorithms• Must understand the mathematical relation between algorithms

3) Comparing algorithms across all hyperparameter settings• Over 5,000 experiments

Page 11: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

11

Background

Page 12: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

12

What is word2vec?

Page 13: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

13

What is word2vec?

How is it related to PMI?

Page 14: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

14

What is word2vec?

• word2vec is not a single algorithm• It is a software package for representing words as vectors, containing:• Two distinct models

• CBoW• Skip-Gram

• Various training methods• Negative Sampling• Hierarchical Softmax

• A rich preprocessing pipeline• Dynamic Context Windows• Subsampling• Deleting Rare Words

Page 15: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

15

What is word2vec?

• word2vec is not a single algorithm• It is a software package for representing words as vectors, containing:• Two distinct models

• CBoW• Skip-Gram (SG)

• Various training methods• Negative Sampling (NS)• Hierarchical Softmax

• A rich preprocessing pipeline• Dynamic Context Windows• Subsampling• Deleting Rare Words

Page 16: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

16

Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.

“word2vec Explained…”Goldberg & Levy, arXiv 2014

Page 17: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

17

Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.

“word2vec Explained…”Goldberg & Levy, arXiv 2014

Page 18: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

18

Skip-Grams with Negative Sampling (SGNS)Marco saw a furry little wampimuk hiding in the tree.

words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in… …

“word2vec Explained…”Goldberg & Levy, arXiv 2014

(data)

Page 19: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

19

Skip-Grams with Negative Sampling (SGNS)• SGNS finds a vector for each word in our vocabulary • Each such vector has latent dimensions (e.g. )• Effectively, it learns a matrix whose rows represent • Key point: it also learns a similar auxiliary matrix of context vectors• In fact, each word has two embeddings

“word2vec Explained…”Goldberg & Levy, arXiv 2014

𝑊𝑑

𝑉𝑊

:wampimuk

𝐶𝑉𝐶

𝑑

:wampimuk

Page 20: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

20

Skip-Grams with Negative Sampling (SGNS)

“word2vec Explained…”Goldberg & Levy, arXiv 2014

Page 21: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

21

Skip-Grams with Negative Sampling (SGNS)• Maximize: • was observed with

words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in

“word2vec Explained…”Goldberg & Levy, arXiv 2014

Page 22: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

22

Skip-Grams with Negative Sampling (SGNS)• Maximize: • was observed with

words contextswampimuk furrywampimuk littlewampimuk hidingwampimuk in

• Minimize: • was hallucinated with

words contextswampimuk Australiawampimuk cyberwampimuk thewampimuk 1985

“word2vec Explained…”Goldberg & Levy, arXiv 2014

Page 23: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

23

Skip-Grams with Negative Sampling (SGNS)• “Negative Sampling”• SGNS samples contexts at random as negative examples• “Random” = unigram distribution

• Spoiler: Changing this distribution has a significant effect

Page 24: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

24

What is SGNS learning?

Page 25: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

25

What is SGNS learning?

• Take SGNS’s embedding matrices ( and )

“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014

𝑊𝑑

𝑉𝑊

𝑉𝐶

𝑑

𝐶

Page 26: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

26

What is SGNS learning?

• Take SGNS’s embedding matrices ( and )• Multiply them• What do you get?

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑

“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014

Page 27: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

27

What is SGNS learning?

• A matrix• Each cell describes the relation between a specific word-context pair

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑

“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014

?¿

𝑉𝑊

𝑉 𝐶

Page 28: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

28

What is SGNS learning?

• We proved that for large enough and enough iterations

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑

“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014

?¿

𝑉𝑊

𝑉 𝐶

Page 29: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

29

What is SGNS learning?

• We proved that for large enough and enough iterations• We get the word-context PMI matrix

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑

“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014

𝑀𝑃𝑀𝐼¿

𝑉𝑊

𝑉 𝐶

Page 30: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

30

What is SGNS learning?

• We prove that for large enough and enough iterations• We get the word-context PMI matrix, shifted by a global constant

𝑊𝑑

𝑉𝑊

𝐶𝑉 𝐶

𝑑

“Neural Word Embeddings as Implicit Matrix Factorization”Levy & Goldberg, NIPS 2014

𝑀𝑃𝑀𝐼¿

𝑉𝑊

𝑉 𝐶

− log𝑘

Page 31: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

31

What is SGNS learning?

• SGNS is doing something very similar to the older approaches

• SGNS is factorizing the traditional word-context PMI matrix

• So does SVD!

• GloVe factorizes a similar word-context matrix

Page 32: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

32

But embeddings are still better, right?• Plenty of evidence that embeddings outperform traditional methods• “Don’t Count, Predict!” (Baroni et al., ACL 2014)• GloVe (Pennington et al., EMNLP 2014)

• How does this fit with our story?

Page 33: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

33

The Big Impact of “Small” Hyperparameters

Page 34: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

34

The Big Impact of “Small” Hyperparameters• word2vec & GloVe are more than just algorithms…

• Introduce new hyperparameters

• May seem minor, but make a big difference in practice

Page 35: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

35

Identifying New Hyperparameters

Page 36: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

36

New Hyperparameters

• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words

• Postprocessing (GloVe)• Adding Context Vectors

• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing

Page 37: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

37

New Hyperparameters

• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words

• Postprocessing (GloVe)• Adding Context Vectors

• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing

Page 38: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

38

New Hyperparameters

• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words

• Postprocessing (GloVe)• Adding Context Vectors

• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing

Page 39: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

39

New Hyperparameters

• Preprocessing (word2vec)• Dynamic Context Windows• Subsampling• Deleting Rare Words

• Postprocessing (GloVe)• Adding Context Vectors

• Association Metric (SGNS)• Shifted PMI• Context Distribution Smoothing

Page 40: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

40

Dynamic Context Windows

Marco saw a furry little wampimuk hiding in the tree.

Page 41: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

41

Dynamic Context Windows

Marco saw a furry little wampimuk hiding in the tree.

Page 42: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

42

Dynamic Context Windows

Marco saw a furry little wampimuk hiding in the tree.

word2vec:

GloVe:

Aggressive:

The Word-Space Model (Sahlgren, 2006)

Page 43: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

43

Adding Context Vectors

• SGNS creates word vectors • SGNS creates auxiliary context vectors • So do GloVe and SVD

Page 44: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

44

Adding Context Vectors

• SGNS creates word vectors • SGNS creates auxiliary context vectors • So do GloVe and SVD

• Instead of just • Represent a word as:

• Introduced by Pennington et al. (2014)• Only applied to GloVe

Page 45: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

45

Adapting Hyperparameters across Algorithms

Page 46: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

46

Context Distribution Smoothing

• SGNS samples to form negative examples

• Our analysis assumes is the unigram distribution

Page 47: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

47

Context Distribution Smoothing

• SGNS samples to form negative examples

• Our analysis assumes is the unigram distribution

• In practice, it’s a smoothed unigram distribution

• This little change makes a big difference

Page 48: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

48

Context Distribution Smoothing

• We can adapt context distribution smoothing to PMI!

• Replace with :

• Consistently improves PMI on every task

• Always use Context Distribution Smoothing!

Page 49: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

49

Comparing Algorithms

Page 50: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

50

Controlled Experiments

• Prior art was unaware of these hyperparameters

• Essentially, comparing “apples to oranges”

• We allow every algorithm to use every hyperparameter

Page 51: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

51

Controlled Experiments

• Prior art was unaware of these hyperparameters

• Essentially, comparing “apples to oranges”

• We allow every algorithm to use every hyperparameter*

* If transferable

Page 52: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

52

Systematic Experiments

• 9 Hyperparameters• 6 New

• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe

• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks

• 5,632 experiments

Page 53: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

53

Systematic Experiments

• 9 Hyperparameters• 6 New

• 4 Word Representation Algorithms• PPMI (Sparse & Explicit)• SVD(PPMI)• SGNS• GloVe

• 8 Benchmarks• 6 Word Similarity Tasks• 2 Analogy Tasks

• 5,632 experiments

Page 54: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

54

Hyperparameter Settings

Classic Vanilla Setting(commonly used for distributional baselines)

• Preprocessing• <None>

• Postprocessing• <None>

• Association Metric• Vanilla PMI/PPMI

Page 55: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

55

Hyperparameter Settings

Classic Vanilla Setting(commonly used for distributional baselines)

• Preprocessing• <None>

• Postprocessing• <None>

• Association Metric• Vanilla PMI/PPMI

Recommended word2vec Setting(tuned for SGNS)

• Preprocessing• Dynamic Context Window• Subsampling

• Postprocessing• <None>

• Association Metric• Shifted PMI/PPMI• Context Distribution Smoothing

Page 56: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

56

Experiments

PPMI (Sparse Vectors) SGNS (Embeddings)0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

WordSim-353 Relatedness

Spea

rman

’s Co

rrel

ation

Page 57: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

57

Experiments: Prior Art

PPMI (Sparse Vectors) SGNS (Embeddings)0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

VanillaSetting

0.54

VanillaSetting

0.587

word2vecSetting

0.688

word2vecSetting

0.623

WordSim-353 Relatedness

Spea

rman

’s Co

rrel

ation

Experiments: “Apples to Apples”Experiments: “Oranges to Oranges”

Page 58: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

58

Experiments: “Oranges to Oranges”Experiments: Hyperparameter Tuning

PPMI (Sparse Vectors) SGNS (Embeddings)0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

VanillaSetting

0.54

VanillaSetting

0.587

word2vecSetting

0.688

word2vecSetting

0.623

OptimalSetting

0.697

OptimalSetting

0.681

WordSim-353 Relatedness

Spea

rman

’s Co

rrel

ation

[different settings]

Page 59: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

59

Overall Results

• Hyperparameters often have stronger effects than algorithms

• Hyperparameters often have stronger effects than more data

• Prior superiority claims were not accurate

Page 60: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

60

Re-evaluating Prior Claims

Page 61: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

61

Don’t Count, Predict! (Baroni et al., 2014)• “word2vec is better than count-based methods”

• Hyperparameter settings account for most of the reported gaps

• Embeddings do not really outperform count-based methods

Page 62: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

62

Don’t Count, Predict! (Baroni et al., 2014)• “word2vec is better than count-based methods”

• Hyperparameter settings account for most of the reported gaps

• Embeddings do not really outperform count-based methods*

* Except for one task…

Page 63: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

63

GloVe (Pennington et al., 2014)

• “GloVe is better than word2vec”

• Hyperparameter settings account for most of the reported gaps• Adding context vectors applied only to GloVe• Different preprocessing

• We observed the opposite• SGNS outperformed GloVe on every task

• Our largest corpus: 10 billion tokens• Perhaps larger corpora behave differently?

Page 64: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

64

GloVe (Pennington et al., 2014)

• “GloVe is better than word2vec”

• Hyperparameter settings account for most of the reported gaps• Adding context vectors applied only to GloVe• Different preprocessing

• We observed the opposite• SGNS outperformed GloVe on every task

• Our largest corpus: 10 billion tokens• Perhaps larger corpora behave differently?

Page 65: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

65

Linguistic Regularities in Sparse and ExplicitWord Representations (Levy and Goldberg, 2014)

• “PPMI vectors perform on par with SGNS on analogy tasks”

• Holds for semantic analogies• Does not hold for syntactic analogies (MSR dataset)

• Hyperparameter settings account for most of the reported gaps• Different context type for PPMI vectors

• Syntactic Analogies: there is a real gap in favor of SGNS

Page 66: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

66

Conclusions

Page 67: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

67

Conclusions: Distributional Similarity

The Contributions of Word Embeddings:• Novel Algorithms• New Hyperparameters

What’s really improving performance?• Hyperparameters (mostly)• The algorithms are an improvement• SGNS is robust

Page 68: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

68

Conclusions: Distributional Similarity

The Contributions of Word Embeddings:• Novel Algorithms• New Hyperparameters

What’s really improving performance?• Hyperparameters (mostly)• The algorithms are an improvement• SGNS is robust & efficient

Page 69: Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1

69

Conclusions: Methodology

• Look for hyperparameters

• Adapt hyperparameters across different algorithms

• For good results: tune hyperparameters

• For good science: tune baselines’ hyperparameters

Thank you :)