distributional initialization of neural networks - lmu …irina/sergienya_irina_defense.pdf ·...

Distributional Initialization of Neural Networks

Irina Sergienya

Inaugural-Dissertationzur Erlangung des Doktorgradesder Philosophie an der Ludwig-Maximilians Universität

München, 05.07.2016

Overview

2

Overview

3

NLP

Overview

4

NLP

Representation Learning

Overview

5

NLP

MachineLearning


Overview

6

NLP

MachineLearning

NN


Motivation

Main Goal: Improve performance of ML techniques on NLP tasks.

7

Motivation


1. Improve distributed word representations, especially for rare words.

2. Improve performance of language models.

3. Analyse performance of word2vec tool.

8

Motivation





9

Motivation


Why do we care about quality of representations?● Better representations lead to better performance;

Why do we care about rare words?● Languages with rich morphology.● New words/senses.

10

Overview

11

Word representations

NLP

MachineLearning

NN


Motivation





12

Motivation

2. Improve performance of language models (LMs).

What are LMs?● LMs predict probability of a given sequence of words.

Why are LMs important?● LMs are used in NLP tasks,● LMs are parts of real-world systems:

○ automatic speech recognition, ○ machine translation.

13

Overview

14


NLP

MachineLearning

NN


Language modeling

Motivation





15


What is word2vec?● Tool to learn distributed word representations

[Mikolov et al. 2013]. ● Widely used in NLP community.

Why is it important to analyse it?● To verify that previous work is valid.● To better understand it.

Motivation

16

Overview

17


Word2vec tool

NLP

MachineLearning

NN


Language modeling

Contributions

Proposals:1. Initialization of NNs.

a. Distributional initialization of NNs (word2vec, LBL).

b. Handful of different distributional representations with different combinations, association functions, normalizations.

c. Combination of distributional and one-hot representations.

2. Analysis of the word2vec embeddings wrt different initial random seeds.

18

Contributions






19

Initialization

Initialization:● Input representations● Parameters initialization● Hyperparameters initialization

Right initialization can help:● To find solution faster● To find better solution

20

Findings

1. On word similarity judgment task, distributional initialization is better than traditional one-hot initialization.

2. On language modeling task, initialization of NNLMs with distributional representations brings no or minor improvements.

3. Word2vec performance is stable wrt random seeds.

21

0. Motivation.

1. Learning Better Embeddings for Rare Words UsingDistributional Representations.

2. Language Modeling.

3. Variability of word2vec.

4. Conclusion.

Talk outline

22

23

Questions?

Learning Better Embeddings for Rare Words Using Distributional

Representations*

25

* Irina Sergienya and Hinrich Schütze. Learning Better Embeddings for Rare Words Using Distributional Representations. In Processing of the 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP), pages 280-285, Lisbon, Portugal, 2015

* Irina Sergienya and Hinrich Schütze. Distributional models and deep learning embeddings: Combining the best of both worlds. In International Conference on Learning Representations (ICLR), Banff, Canada, 2014

Motivation

Main goal:Improve distributed word representations, especially for rare words.Why do we care about quality of representations?

- Better representations lead to better performance;Why do we care about rare words?

- Languages with rich morphology.- New words/senses.

What is the distributed representation, and what are the others?● Clustering-based representation;● Distributional representation (count vectors);● Distributed representation (word embeddings).

26

Distributional representations

Distributional representation (count vectors)

● Long● Sparse● Contain co-occurrence statistics

(e.g. raw counts, binary counts, PPMI)● Easy to interpret:

Each dimension corresponds to a context● Created by computing statistics from corpus

27


Distributional representation (count vectors)

28

1 brave 70 1 2 0 0 1 2 4 17 4 0 3 ...2 hysteria 16 59 0 22 29 11 0 0 0 30 0 0 ...3 mushroomed 25 0 33 0 0 23 32 6 0 0 31 18 ...4 provisionally 45 0 4 55 0 39 12 0 0 0 0 0 ...5 conformism 44 0 0 41 89 45 0 0 37 39 0 0 ...6 profession 0 5 3 0 0 52 49 19 4 15 39 0 ...7 incalculable 27 0 46 30 23 26 87 20 0 0 0 43 ...8 monoclinic 25 0 0 0 0 41 24 31 19 0 0 0 ...9 cat 0 3 0 0 0 18 3 40 45 0 0 0 ...

10 discharged 0 19 18 0 33 5 26 0 31 57 0 25 ...11 untracked 10 0 40 0 0 49 16 14 5 43 80 31 ...12 religion 33 0 26 48 25 0 2 16 18 0 30 34 ...

... ...

Distributed representations

Distributed representation (word embeddings)

● Short● Dense● Contain associations between word and latent topic● Hard to interpret:

Each dimension corresponds to a latent topic● Created by dimensionality reduction techniques and

NNs

29

Word embeddings via NN training

30

The cat sat on the mat

1 brave2 hysteria3 mushroomed4 provisionally5 conformism6 profession7 incalculable8 monoclinic9 cat

10 discharged11 untracked12 religion

...

Word embeddings via NN training

31

The cat sat on the mat

1 brave2 hysteria3 mushroomed4 provisionally5 conformism6 profession7 incalculable8 monoclinic9 cat

10 discharged11 untracked12 religion

...

00000000100000000000...

One-hot initialization

32



... ...

Distributional initialization

33



... ...

Proposal:

Proposed initialization

Proposed initialization schemes:● Rare words threshold: parameter ∊{10, 20, 50, 100};

34


Proposed initialization schemes:● Rare words threshold: parameter ∊{10, 20, 50, 100};● Binary and PPMI vectors:

BINARY:

PPMI (Positive Pointwise Mutual Information):

35


Proposed initialization schemes:● Rare words threshold: parameter ∊{10, 20, 50, 100};● Binary and PPMI vectors;● Separate and Mixed.

36

Experimental setup

Training:Corpus: ukWac+WaCkypedia (2.4B / 2.7M)

NN model: modified word2vec

Evaluation:Word similarity judgment task

● RG [Rubenstein and Goodenough, 1965]

● MC [Miller and Charles, 1991]

● MEN [Bruni et al., 2012]

● WordSim353 (WS) [Finkelstein et al., 2001]

● Stanford Rare Word (RW) [Luong et al., 2013]

● SimLex-999 (SL) [Hill et al., 2015]

37

#pairs #wordsRG 65 48MC 30 39MEN 3000 751WS 353 437RW 2034 2942SL 999 1028

Results

38

Results

Main result: Distributional initialization improves the quality of word

embeddings for rare words.

Recommendation: Mixed initialization with PPMI values and the frequency

threshold = 20.

Future work:● Detailed analysis of performance.● Effect on words with high frequencies.

39

0. Motivation.




4. Conclusion.

Talk outline

41

42

Language modeling

Motivation (recap)

Main goal:Improve performance of language models (LMs).

What are LMs?● LMs predict probability of a given sequence of words.

Why are LMs important?● LMs are used in NLP tasks,● LMs are parts of real-world systems:

○ automatic speech recognition, ○ machine translation.

43

LMs

LMs predict probability of a given sequence of words:

44

LMs


n-gram LMs: account history of length n:

45

LMs


n-gram LMs: account history of length n:

Modified Kneser-Ney smoothing LM [Chen and Goodman, 1999]

46

NNLMs

Neural Network LMs:Log-bilinear Language Model [Mnih and Hinton, 2008]:

47

C1 C2

The cat

sat

NNLMs

Neural Network LMs:Log-bilinear Language Model [Mnih and Hinton, 2008]:

48

C1 C2

The cat

sat

Experimental setup

Training corpus:Wall Street Journal corpus [Marcus et al.1999]Training: parts 00-20 (1M tokens, 45K word types)Test: parts 21-22 (80K tokens)

Vocabularies:● 45K - full vocabulary;● 2vocab - words with frequency >1;● 10K - top frequent 10K words;

Evaluation:Perplexity of interpolated language models

49

Experimental framework

Framework:● Baseline

● Explore NNLMs with different distributional initializations

● Report perplexity results for models with and without distributional initialization.

50


Framework:● Baseline:

○ train n-gram LM (Modified Kneser-Ney),■ 173.07 for 3-gram model (KN3) ■ 186.17 for 5-gram model

○ train neural LM with one-hot initialization,○ interpolate n-gram and NNLMs:○ evaluate perplexity of the interpolated model.

51



○ train n-gram LM (Modified Kneser-Ney),○ train neural LM with one-hot initialization,○ interpolate n-gram and NNLMs:

○ evaluate perplexity of the interpolated model.



52



○ train n-gram LM (Modified Kneser-Ney),○ train neural LM with one-hot initialization,○ interpolate n-gram and NNLMs:

○ evaluate perplexity of the interpolated model.



53

Method: Distributional initialization of NNLM.

Proposed Method

54

The vectors stay fixed during the training, while values of are adapted.

- distributional vector

- randomly initialized low-dimensional embeddings

Represent


Combination schemes:● SEPARATE and MIXED.

Association measurement schemes:● BINARY,● Positioned PMI/PPMI ,● Letter 3-grams.

Normalization schemes:● constant (W),● scale (S),● row normalization (RN),● column normalization (CN).

55


Combination schemes:● SEPARATE and MIXED.

56


Association measurement schemes:● BINARY: vector elements are from {0, 1},● Positioned PMI/PPMI,● Letter 3-grams.

57


Association measurement schemes:● BINARY: vector elements are from {0, 1},● Positioned PMI/PPMI, ● Letter 3-grams.

For positions {-2, -1, 1, 2}, compute PPMI vectors.

58

SEPARATE: concatenate PPMIpos vectors, Apply normalization.

MIXED: concatenate PPMIpos vectors;Compute similarities of these representations;Create vectors depending on similarity scores.


Normalization schemes:● constant (W): set all non-diag elements to constant;● scale (S): divide all non-diag elements by a fixed

constant to scale values to [0, 1], [0, .5], [0, .1];● row normalization (RN): divide every non-diag

element by the sum of the row values; scale to [0, 1], [0, .5], [0, .1];

● column normalization (CN): divide every non-diag element by the sum of the column values; scale to [0, 1], [0, .5], [0, .1];

59

Hyperparameters

● Frequency ranges for distributional initialization:○ Interval ∊{[1,1], [1,2], [1,5], [1,10]},○ ONLY ∊{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20}.

● Initial learning rate ∊{1, .5, .1, .05, .01}.

● Similarity threshold adjustment for MIXED models:○ NOO ∊{10, 30}

● Corpus preprocessing: ○ Replace all digits with one token,○ Replace all digits with one token and lowercase

all words.

60

Association Combination Normalization Vocabularies

BINARY MIXED - 45K/45K, 45K/10K, 2vocab/2vocab, 2vocab/10K, 10K/10K

interval

Pos. PPMI MIXED - 45K/45K interval

SEPARATE - 45K interval

MIXED W 45K/45K only

SEPARATE S,CN,RN 45K only

Letter 3-gram MIXED W 45K/45K interval

SEPARATE S,CN,RN 45K interval

MIXED W 45K/45K only

SEPARATE S,CN,RN 45K only

Experiments

61

+ Initial learning rate ∊{1, .5, .1, .05, .01}.+ Similarity threshold adjustment for MIXED models NOO ∊{10, 30}.+ Corpus preprocessing.

Mixed ONLY models

62

Result

Main result: For proposed distributional initialization schemes, no or minor

improvement over one-hot baseline was observed.

Recommendations:● Mixed initialization with weighting scheme for words with

frequency up to 10.● Words with frequency 1 need special treatment.

Future work:● Use distributional initialization both on input and output of NN;● Employ more sophisticated NNLMs: LSTM, CNN.

63

0. Motivation.




4. Conclusion.

Talk outline

65

66

Variability of word2vec

Main goal: Analyse performance of word2vec tool.

What is word2vec?● Tool to learn distributed word representations

[Mikolov et al. 2013]. ● Widely used in NLP community.

Why is it important to analyse it?● To verify that previous work is valid.● To better understand it.

Motivation (recap)

67

Word2vec overview

Models: CBOW and Skip-gram

Training: hierarchical softmax and negative sampling

68

Pseudo-randomization in word2vec

Pseudo-random number generation in word2vec:

//initiate random variable with a given valuenext_random = {1 or threadID}; //generate random values while random value is needed do

next_random = next_random * 25214903917 + 11; //use generated pseudo-random value next_random {...};

end

69

Random initialization of word2vec

Pseudo-randomization happens in several places in the word2vec code:

● first seed, initial value is 1:○ to initialize the matrix of word embeddings;

● second seed, initial value is thread ID:○ during the subsampling of frequent words;

○ during the training, to choose a context window size for each

target word;

○ during negative sampling, to choose indices of words that are used

as negative examples;

Multithreading and asynchronous updates of NN weights.

70

Random initialization of word2vec

Pseudo-randomization happens in several places in the word2vec code:

● first seed, initial value is 1 seed:○ to initialize the matrix of word embeddings;

● second seed, initial value is thread ID:○ during the subsampling of frequent words;

○ during the training, to choose a context window size for each

target word;

○ during negative sampling, to choose indices of words that are used

as negative examples;

Multithreading and asynchronous updates of NN weights.

71

Experiments

Experiments:1. Determine number of training epochs:

● fix seed=1, ● train models for different number of epochs,● pick meaningful number of epochs.

2. Analyze random seeds:Using determined number of epochs, train models for different random seeds.

72

Evaluation I

73

3 evaluation metrics:● Comparing 2 trained models:

1. topNN: common words among the top 10 nearest neighbors;

2. rankNN: the correlation of distances between words.

Evaluation II

Evaluation words:

74

Frequency intervals

Number of words

20 randomly picked words

[1, 10] 29043 -0.00, ayers, calisto, chiappa, el-sadr, flower-bordered, kidnappers, mattone, mountaintop, norma, piracy, subskill, configuration, loot, rexall, envisioned, plentiful, endorsing, curbs, templeton

[10, 100] 5670 bags, oils, belong, curry, deliberately, responses, constant, yale, tax-exempt, denies, jerry, chosen, iowa, 0000.0, cellular, hearings, extremely, ounce, option, authority

[100, 1000] 1013 age, won, announcement, france, plc, thought, thing, merrill, role, growing, 0.0000, black, stores, los, provide, increased, real, public, what, federal

[1000, ...] 105 &, all, corp., not, who, up, were, would, company, 000, have, he, its, mr., from, by, it, that, a, the

[all_words] 35382 age-discrimination(1), citizenry(1), less-creditworthy(1), lugging(1), oneyear(1), ton(24), profit-margin(1), sewing(1), unwitting(1), rounds(2), simplicity(2), awesome(3), brush(5), cartoonist(3), brushed(5), unpredictable(5), wsj(7), aluminum(18), dennis(24), contributed(83)

Evaluation III

3 evaluation metrics:● Comparing 2 trained models:

1. topNN: common words among the top 10 nearest neighbors;

2. rankNN: the correlation of distances between words.

● Quality of learned embeddings of single model:3. word similarity judgment task.

75

Word2vec hyperparameters

Model architecture: CBOW, Skip-gram.Objective: hierarchical softmax, negative sampling.

Skip-gram with negative sampling, (k=5 samples);Skip-gram with hierarchical softmax;CBOW with negative sampling, (k=5 samples);CBOW with hierarchical softmax;

Number of epochs: {1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100}

Seeds (randomly picked from [1,1000]): {32, 291, 496, 614, 264, 724, 549, 802, 315, 77}

76

Experiment setup

Training corpus:Wall Street Journal corpus, parts 00-20 (~1M/~35K)

Evaluation:Word similarity judgment task

● RG [Rubenstein and Goodenough, 1965]

● MC [Miller and Charles, 1991]

● MEN [Bruni et al., 2012]

● WordSim353 (WS) [Finkelstein et al., 2001]

● Stanford Rare Word (RW) [Luong et al., 2013]

● SimLex-999 (SL) [Hill et al., 2015]

77

#pairs #coveredRG 65 46MC 30 21MEN 3000 2212WS 353 321RW 2034 405SL 999 907

Number of epochs experiment

78

Number of epochs:

Skip-gram, 5 negative samples.

Results of epoch number experiment

79

topNN and rankNN for different seeds

topNN, 1 thread, 20 epochs:

topNN, 1 thread, 20 epochs.

rankNN, 1 thread, 20 epochs.80

topNN and rankNN for different seeds

rankNN, 1 thread, 20 epochs.

81

Similarity with different seeds

Similarity with different seeds

S

Skip-gram, 5 negative samples

82

Similarity for different seeds

83

Results

Main result: Word2vec seems to produce remarkably stable results for different initial random seeds.

10 models with different random seeds appeared to be different, yet the structure of the learned embedding spaces was found to be very similar.

Future work:● Learn and compare transformation matrices.● Full randomization of word2vec.● Different initialization strategies of the embedding matrix.

84

0. Motivation.

1. Learning Better Embeddings for Rare Words Using Distributional Representations.



4. Conclusion.

Talk outline

86

87

Conclusion

Contributions






88

Findings

1. On word similarity judgment task, distributional initialization is better than traditional one-hot initialization.

2. On language modeling task, initialization of NNLMs with distributional representations brings no or minor improvements.

3. Word2vec performance is stable wrt random seeds.

89

Thank you!

References

[Bruni et al.2012] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In ACL, pages 136–145.

[Chen and Goodman, 1999] Stanley F. Chen and Joshua Goodman. 1999. An Empirical Study of Smoothing Techniques for LanguageModeling. Computer Speech and Language, 13(4):359{394, October.

[Finkelstein et al.2001] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In WWW, pages 406–414.

[Hill et al.2015] Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, December.

[Luong et al.2013] Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In CoNLL.

[Marcus et al.1999] Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. 1999. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium.

[Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Workshop at ICLR.

[Miller and Charles, 1991] George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language & Cognitive Processes, 6(1):1–28.

[Mnih, Hinton, 2008] Andriy Mnih and Geoffrey E. Hinton. 2008. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21, pages 1081–1088. Curran Associates, Inc.

[Rubenstein and Goodenough, 1965] Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM, 8(10):627–633, October.

91

Corpus downsampling

Substitute “fire” → “*fire*”

92

Frequency parameter

Frequency ranges for distributional initialization:● Interval ;● ONLY.

93

Word2vec parameters

Skip-gram model, hierarchical softmax, set the size of the context window to 10 (10 words to the left and 10 to the right), min-count to 1 (train on all tokens), embedding size to 100, sampling rate to 10−3 and train models for one epoch

The size of the context window to 5 (5 words to the left and 5 to the right), embedding size to 100, sampling rate to 10−3 , and the initial learning rate alpha to 0.025 for Skip-gram and 0.05 for CBOW.

94

distributional initialization of neural networks - lmu …irina/sergienya_irina_defense.pdf ·...

Documents