distributional initialization of neural networks - lmu …irina/sergienya_irina_defense.pdf ·...
TRANSCRIPT
Distributional Initialization of Neural Networks
Irina Sergienya
Inaugural-Dissertationzur Erlangung des Doktorgradesder Philosophie an der Ludwig-Maximilians Universität
München, 05.07.2016
Motivation
Main Goal: Improve performance of ML techniques on NLP tasks.
1. Improve distributed word representations, especially for rare words.
2. Improve performance of language models.
3. Analyse performance of word2vec tool.
8
Motivation
Main Goal: Improve performance of ML techniques on NLP tasks.
1. Improve distributed word representations, especially for rare words.
2. Improve performance of language models.
3. Analyse performance of word2vec tool.
9
Motivation
1. Improve distributed word representations, especially for rare words.
Why do we care about quality of representations?● Better representations lead to better performance;
Why do we care about rare words?● Languages with rich morphology.● New words/senses.
10
Motivation
Main Goal: Improve performance of ML techniques on NLP tasks.
1. Improve distributed word representations, especially for rare words.
2. Improve performance of language models.
3. Analyse performance of word2vec tool.
12
Motivation
2. Improve performance of language models (LMs).
What are LMs?● LMs predict probability of a given sequence of words.
Why are LMs important?● LMs are used in NLP tasks,● LMs are parts of real-world systems:
○ automatic speech recognition, ○ machine translation.
13
Motivation
Main Goal: Improve performance of ML techniques on NLP tasks.
1. Improve distributed word representations, especially for rare words.
2. Improve performance of language models.
3. Analyse performance of word2vec tool.
15
3. Analyse performance of word2vec tool.
What is word2vec?● Tool to learn distributed word representations
[Mikolov et al. 2013]. ● Widely used in NLP community.
Why is it important to analyse it?● To verify that previous work is valid.● To better understand it.
Motivation
16
Overview
17
Word representations
Word2vec tool
NLP
MachineLearning
NN
Representation Learning
Language modeling
Contributions
Proposals:1. Initialization of NNs.
a. Distributional initialization of NNs (word2vec, LBL).
b. Handful of different distributional representations with different combinations, association functions, normalizations.
c. Combination of distributional and one-hot representations.
2. Analysis of the word2vec embeddings wrt different initial random seeds.
18
Contributions
Proposals:1. Initialization of NNs.
a. Distributional initialization of NNs (word2vec, LBL).
b. Handful of different distributional representations with different combinations, association functions, normalizations.
c. Combination of distributional and one-hot representations.
2. Analysis of the word2vec embeddings wrt different initial random seeds.
19
Initialization
Initialization:● Input representations● Parameters initialization● Hyperparameters initialization
Right initialization can help:● To find solution faster● To find better solution
20
Findings
1. On word similarity judgment task, distributional initialization is better than traditional one-hot initialization.
2. On language modeling task, initialization of NNLMs with distributional representations brings no or minor improvements.
3. Word2vec performance is stable wrt random seeds.
21
0. Motivation.
1. Learning Better Embeddings for Rare Words UsingDistributional Representations.
2. Language Modeling.
3. Variability of word2vec.
4. Conclusion.
Talk outline
22
Learning Better Embeddings for Rare Words Using Distributional
Representations*
25
* Irina Sergienya and Hinrich Schütze. Learning Better Embeddings for Rare Words Using Distributional Representations. In Processing of the 2015 Conference on Empirical Methods on Natural Language Processing (EMNLP), pages 280-285, Lisbon, Portugal, 2015
* Irina Sergienya and Hinrich Schütze. Distributional models and deep learning embeddings: Combining the best of both worlds. In International Conference on Learning Representations (ICLR), Banff, Canada, 2014
Motivation
Main goal:Improve distributed word representations, especially for rare words.Why do we care about quality of representations?
- Better representations lead to better performance;Why do we care about rare words?
- Languages with rich morphology.- New words/senses.
What is the distributed representation, and what are the others?● Clustering-based representation;● Distributional representation (count vectors);● Distributed representation (word embeddings).
26
Distributional representations
Distributional representation (count vectors)
● Long● Sparse● Contain co-occurrence statistics
(e.g. raw counts, binary counts, PPMI)● Easy to interpret:
Each dimension corresponds to a context● Created by computing statistics from corpus
27
Distributional representations
Distributional representation (count vectors)
28
1 brave 70 1 2 0 0 1 2 4 17 4 0 3 ...2 hysteria 16 59 0 22 29 11 0 0 0 30 0 0 ...3 mushroomed 25 0 33 0 0 23 32 6 0 0 31 18 ...4 provisionally 45 0 4 55 0 39 12 0 0 0 0 0 ...5 conformism 44 0 0 41 89 45 0 0 37 39 0 0 ...6 profession 0 5 3 0 0 52 49 19 4 15 39 0 ...7 incalculable 27 0 46 30 23 26 87 20 0 0 0 43 ...8 monoclinic 25 0 0 0 0 41 24 31 19 0 0 0 ...9 cat 0 3 0 0 0 18 3 40 45 0 0 0 ...
10 discharged 0 19 18 0 33 5 26 0 31 57 0 25 ...11 untracked 10 0 40 0 0 49 16 14 5 43 80 31 ...12 religion 33 0 26 48 25 0 2 16 18 0 30 34 ...
... ...
Distributed representations
Distributed representation (word embeddings)
● Short● Dense● Contain associations between word and latent topic● Hard to interpret:
Each dimension corresponds to a latent topic● Created by dimensionality reduction techniques and
NNs
29
Word embeddings via NN training
30
The cat sat on the mat
1 brave2 hysteria3 mushroomed4 provisionally5 conformism6 profession7 incalculable8 monoclinic9 cat
10 discharged11 untracked12 religion
...
Word embeddings via NN training
31
The cat sat on the mat
1 brave2 hysteria3 mushroomed4 provisionally5 conformism6 profession7 incalculable8 monoclinic9 cat
10 discharged11 untracked12 religion
...
00000000100000000000...
One-hot initialization
32
1 brave 1 0 0 0 0 0 0 0 0 0 0 0 ...2 hysteria 0 1 0 0 0 0 0 0 0 0 0 0 ...3 mushroomed 0 0 1 0 0 0 0 0 0 0 0 0 ...4 provisionally 0 0 0 1 0 0 0 0 0 0 0 0 ...5 conformism 0 0 0 0 1 0 0 0 0 0 0 0 ...6 profession 0 0 0 0 0 1 0 0 0 0 0 0 ...7 incalculable 0 0 0 0 0 0 1 0 0 0 0 0 ...8 monoclinic 0 0 0 0 0 0 0 1 0 0 0 0 ...9 cat 0 0 0 0 0 0 0 0 1 0 0 0 ...
10 discharged 0 0 0 0 0 0 0 0 0 1 0 0 ...11 untracked 0 0 0 0 0 0 0 0 0 0 1 0 ...12 religion 0 0 0 0 0 0 0 0 0 0 0 1 ...
... ...
Distributional initialization
33
1 brave 70 1 2 0 0 1 2 4 17 4 0 3 ...2 hysteria 16 59 0 22 29 11 0 0 0 30 0 0 ...3 mushroomed 25 0 33 0 0 23 32 6 0 0 31 18 ...4 provisionally 45 0 4 55 0 39 12 0 0 0 0 0 ...5 conformism 44 0 0 41 89 45 0 0 37 39 0 0 ...6 profession 0 5 3 0 0 52 49 19 4 15 39 0 ...7 incalculable 27 0 46 30 23 26 87 20 0 0 0 43 ...8 monoclinic 25 0 0 0 0 41 24 31 19 0 0 0 ...9 cat 0 3 0 0 0 18 3 40 45 0 0 0 ...
10 discharged 0 19 18 0 33 5 26 0 31 57 0 25 ...11 untracked 10 0 40 0 0 49 16 14 5 43 80 31 ...12 religion 33 0 26 48 25 0 2 16 18 0 30 34 ...
... ...
Proposal:
Proposed initialization
Proposed initialization schemes:● Rare words threshold: parameter ∊{10, 20, 50, 100};
34
Proposed initialization
Proposed initialization schemes:● Rare words threshold: parameter ∊{10, 20, 50, 100};● Binary and PPMI vectors:
BINARY:
PPMI (Positive Pointwise Mutual Information):
35
Proposed initialization
Proposed initialization schemes:● Rare words threshold: parameter ∊{10, 20, 50, 100};● Binary and PPMI vectors;● Separate and Mixed.
36
Experimental setup
Training:Corpus: ukWac+WaCkypedia (2.4B / 2.7M)
NN model: modified word2vec
Evaluation:Word similarity judgment task
● RG [Rubenstein and Goodenough, 1965]
● MC [Miller and Charles, 1991]
● MEN [Bruni et al., 2012]
● WordSim353 (WS) [Finkelstein et al., 2001]
● Stanford Rare Word (RW) [Luong et al., 2013]
● SimLex-999 (SL) [Hill et al., 2015]
37
#pairs #wordsRG 65 48MC 30 39MEN 3000 751WS 353 437RW 2034 2942SL 999 1028
Results
Main result: Distributional initialization improves the quality of word
embeddings for rare words.
Recommendation: Mixed initialization with PPMI values and the frequency
threshold = 20.
Future work:● Detailed analysis of performance.● Effect on words with high frequencies.
39
0. Motivation.
1. Learning Better Embeddings for Rare Words UsingDistributional Representations.
2. Language Modeling.
3. Variability of word2vec.
4. Conclusion.
Talk outline
41
Motivation (recap)
Main goal:Improve performance of language models (LMs).
What are LMs?● LMs predict probability of a given sequence of words.
Why are LMs important?● LMs are used in NLP tasks,● LMs are parts of real-world systems:
○ automatic speech recognition, ○ machine translation.
43
LMs
LMs predict probability of a given sequence of words:
n-gram LMs: account history of length n:
45
LMs
LMs predict probability of a given sequence of words:
n-gram LMs: account history of length n:
Modified Kneser-Ney smoothing LM [Chen and Goodman, 1999]
46
Experimental setup
Training corpus:Wall Street Journal corpus [Marcus et al.1999]Training: parts 00-20 (1M tokens, 45K word types)Test: parts 21-22 (80K tokens)
Vocabularies:● 45K - full vocabulary;● 2vocab - words with frequency >1;● 10K - top frequent 10K words;
Evaluation:Perplexity of interpolated language models
49
Experimental framework
Framework:● Baseline
● Explore NNLMs with different distributional initializations
● Report perplexity results for models with and without distributional initialization.
50
Experimental framework
Framework:● Baseline:
○ train n-gram LM (Modified Kneser-Ney),■ 173.07 for 3-gram model (KN3) ■ 186.17 for 5-gram model
○ train neural LM with one-hot initialization,○ interpolate n-gram and NNLMs:○ evaluate perplexity of the interpolated model.
51
Experimental framework
Framework:● Baseline:
○ train n-gram LM (Modified Kneser-Ney),○ train neural LM with one-hot initialization,○ interpolate n-gram and NNLMs:
○ evaluate perplexity of the interpolated model.
● Explore NNLMs with different distributional initializations
● Report perplexity results for models with and without distributional initialization.
52
Experimental framework
Framework:● Baseline:
○ train n-gram LM (Modified Kneser-Ney),○ train neural LM with one-hot initialization,○ interpolate n-gram and NNLMs:
○ evaluate perplexity of the interpolated model.
● Explore NNLMs with different distributional initializations
● Report perplexity results for models with and without distributional initialization.
53
Method: Distributional initialization of NNLM.
Proposed Method
54
The vectors stay fixed during the training, while values of are adapted.
- distributional vector
- randomly initialized low-dimensional embeddings
Represent
Distributional representations
Combination schemes:● SEPARATE and MIXED.
Association measurement schemes:● BINARY,● Positioned PMI/PPMI ,● Letter 3-grams.
Normalization schemes:● constant (W),● scale (S),● row normalization (RN),● column normalization (CN).
55
Distributional representations
Association measurement schemes:● BINARY: vector elements are from {0, 1},● Positioned PMI/PPMI,● Letter 3-grams.
57
Distributional representations
Association measurement schemes:● BINARY: vector elements are from {0, 1},● Positioned PMI/PPMI, ● Letter 3-grams.
For positions {-2, -1, 1, 2}, compute PPMI vectors.
58
SEPARATE: concatenate PPMIpos vectors, Apply normalization.
MIXED: concatenate PPMIpos vectors;Compute similarities of these representations;Create vectors depending on similarity scores.
Distributional representations
Normalization schemes:● constant (W): set all non-diag elements to constant;● scale (S): divide all non-diag elements by a fixed
constant to scale values to [0, 1], [0, .5], [0, .1];● row normalization (RN): divide every non-diag
element by the sum of the row values; scale to [0, 1], [0, .5], [0, .1];
● column normalization (CN): divide every non-diag element by the sum of the column values; scale to [0, 1], [0, .5], [0, .1];
59
Hyperparameters
● Frequency ranges for distributional initialization:○ Interval ∊{[1,1], [1,2], [1,5], [1,10]},○ ONLY ∊{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20}.
● Initial learning rate ∊{1, .5, .1, .05, .01}.
● Similarity threshold adjustment for MIXED models:○ NOO ∊{10, 30}
● Corpus preprocessing: ○ Replace all digits with one token,○ Replace all digits with one token and lowercase
all words.
60
Association Combination Normalization Vocabularies
BINARY MIXED - 45K/45K, 45K/10K, 2vocab/2vocab, 2vocab/10K, 10K/10K
interval
Pos. PPMI MIXED - 45K/45K interval
SEPARATE - 45K interval
MIXED W 45K/45K only
SEPARATE S,CN,RN 45K only
Letter 3-gram MIXED W 45K/45K interval
SEPARATE S,CN,RN 45K interval
MIXED W 45K/45K only
SEPARATE S,CN,RN 45K only
Experiments
61
+ Initial learning rate ∊{1, .5, .1, .05, .01}.+ Similarity threshold adjustment for MIXED models NOO ∊{10, 30}.+ Corpus preprocessing.
Result
Main result: For proposed distributional initialization schemes, no or minor
improvement over one-hot baseline was observed.
Recommendations:● Mixed initialization with weighting scheme for words with
frequency up to 10.● Words with frequency 1 need special treatment.
Future work:● Use distributional initialization both on input and output of NN;● Employ more sophisticated NNLMs: LSTM, CNN.
63
0. Motivation.
1. Learning Better Embeddings for Rare Words UsingDistributional Representations.
2. Language Modeling.
3. Variability of word2vec.
4. Conclusion.
Talk outline
65
Main goal: Analyse performance of word2vec tool.
What is word2vec?● Tool to learn distributed word representations
[Mikolov et al. 2013]. ● Widely used in NLP community.
Why is it important to analyse it?● To verify that previous work is valid.● To better understand it.
Motivation (recap)
67
Word2vec overview
Models: CBOW and Skip-gram
Training: hierarchical softmax and negative sampling
68
Pseudo-randomization in word2vec
Pseudo-random number generation in word2vec:
//initiate random variable with a given valuenext_random = {1 or threadID}; //generate random values while random value is needed do
next_random = next_random * 25214903917 + 11; //use generated pseudo-random value next_random {...};
end
69
Random initialization of word2vec
Pseudo-randomization happens in several places in the word2vec code:
● first seed, initial value is 1:○ to initialize the matrix of word embeddings;
● second seed, initial value is thread ID:○ during the subsampling of frequent words;
○ during the training, to choose a context window size for each
target word;
○ during negative sampling, to choose indices of words that are used
as negative examples;
Multithreading and asynchronous updates of NN weights.
70
Random initialization of word2vec
Pseudo-randomization happens in several places in the word2vec code:
● first seed, initial value is 1 seed:○ to initialize the matrix of word embeddings;
● second seed, initial value is thread ID:○ during the subsampling of frequent words;
○ during the training, to choose a context window size for each
target word;
○ during negative sampling, to choose indices of words that are used
as negative examples;
Multithreading and asynchronous updates of NN weights.
71
Experiments
Experiments:1. Determine number of training epochs:
● fix seed=1, ● train models for different number of epochs,● pick meaningful number of epochs.
2. Analyze random seeds:Using determined number of epochs, train models for different random seeds.
72
Evaluation I
73
3 evaluation metrics:● Comparing 2 trained models:
1. topNN: common words among the top 10 nearest neighbors;
2. rankNN: the correlation of distances between words.
Evaluation II
Evaluation words:
74
Frequency intervals
Number of words
20 randomly picked words
[1, 10] 29043 -0.00, ayers, calisto, chiappa, el-sadr, flower-bordered, kidnappers, mattone, mountaintop, norma, piracy, subskill, configuration, loot, rexall, envisioned, plentiful, endorsing, curbs, templeton
[10, 100] 5670 bags, oils, belong, curry, deliberately, responses, constant, yale, tax-exempt, denies, jerry, chosen, iowa, 0000.0, cellular, hearings, extremely, ounce, option, authority
[100, 1000] 1013 age, won, announcement, france, plc, thought, thing, merrill, role, growing, 0.0000, black, stores, los, provide, increased, real, public, what, federal
[1000, ...] 105 &, all, corp., not, who, up, were, would, company, 000, have, he, its, mr., from, by, it, that, a, the
[all_words] 35382 age-discrimination(1), citizenry(1), less-creditworthy(1), lugging(1), oneyear(1), ton(24), profit-margin(1), sewing(1), unwitting(1), rounds(2), simplicity(2), awesome(3), brush(5), cartoonist(3), brushed(5), unpredictable(5), wsj(7), aluminum(18), dennis(24), contributed(83)
Evaluation III
3 evaluation metrics:● Comparing 2 trained models:
1. topNN: common words among the top 10 nearest neighbors;
2. rankNN: the correlation of distances between words.
● Quality of learned embeddings of single model:3. word similarity judgment task.
75
Word2vec hyperparameters
Model architecture: CBOW, Skip-gram.Objective: hierarchical softmax, negative sampling.
Skip-gram with negative sampling, (k=5 samples);Skip-gram with hierarchical softmax;CBOW with negative sampling, (k=5 samples);CBOW with hierarchical softmax;
Number of epochs: {1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100}
Seeds (randomly picked from [1,1000]): {32, 291, 496, 614, 264, 724, 549, 802, 315, 77}
76
Experiment setup
Training corpus:Wall Street Journal corpus, parts 00-20 (~1M/~35K)
Evaluation:Word similarity judgment task
● RG [Rubenstein and Goodenough, 1965]
● MC [Miller and Charles, 1991]
● MEN [Bruni et al., 2012]
● WordSim353 (WS) [Finkelstein et al., 2001]
● Stanford Rare Word (RW) [Luong et al., 2013]
● SimLex-999 (SL) [Hill et al., 2015]
77
#pairs #coveredRG 65 46MC 30 21MEN 3000 2212WS 353 321RW 2034 405SL 999 907
topNN and rankNN for different seeds
topNN, 1 thread, 20 epochs:
topNN, 1 thread, 20 epochs.
rankNN, 1 thread, 20 epochs.80
Results
Main result: Word2vec seems to produce remarkably stable results for different initial random seeds.
10 models with different random seeds appeared to be different, yet the structure of the learned embedding spaces was found to be very similar.
Future work:● Learn and compare transformation matrices.● Full randomization of word2vec.● Different initialization strategies of the embedding matrix.
84
0. Motivation.
1. Learning Better Embeddings for Rare Words Using Distributional Representations.
2. Language Modeling.
3. Variability of word2vec.
4. Conclusion.
Talk outline
86
Contributions
Proposals:1. Initialization of NNs.
a. Distributional initialization of NNs (word2vec, LBL).
b. Handful of different distributional representations with different combinations, association functions, normalizations.
c. Combination of distributional and one-hot representations.
2. Analysis of the word2vec embeddings wrt different initial random seeds.
88
Findings
1. On word similarity judgment task, distributional initialization is better than traditional one-hot initialization.
2. On language modeling task, initialization of NNLMs with distributional representations brings no or minor improvements.
3. Word2vec performance is stable wrt random seeds.
89
References
[Bruni et al.2012] Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In ACL, pages 136–145.
[Chen and Goodman, 1999] Stanley F. Chen and Joshua Goodman. 1999. An Empirical Study of Smoothing Techniques for LanguageModeling. Computer Speech and Language, 13(4):359{394, October.
[Finkelstein et al.2001] Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In WWW, pages 406–414.
[Hill et al.2015] Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695, December.
[Luong et al.2013] Minh-Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In CoNLL.
[Marcus et al.1999] Mitchell Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, and Ann Taylor. 1999. Treebank-3 LDC99T42. Web Download. Philadelphia: Linguistic Data Consortium.
[Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In Workshop at ICLR.
[Miller and Charles, 1991] George A. Miller and Walter G. Charles. 1991. Contextual correlates of semantic similarity. Language & Cognitive Processes, 6(1):1–28.
[Mnih, Hinton, 2008] Andriy Mnih and Geoffrey E. Hinton. 2008. A Scalable Hierarchical Distributed Language Model. In Advances in Neural Information Processing Systems 21, pages 1081–1088. Curran Associates, Inc.
[Rubenstein and Goodenough, 1965] Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM, 8(10):627–633, October.
91
Word2vec parameters
Skip-gram model, hierarchical softmax, set the size of the context window to 10 (10 words to the left and 10 to the right), min-count to 1 (train on all tokens), embedding size to 100, sampling rate to 10−3 and train models for one epoch
The size of the context window to 5 (5 words to the left and 5 to the right), embedding size to 100, sampling rate to 10−3 , and the initial learning rate alpha to 0.025 for Skip-gram and 0.05 for CBOW.
94