lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...
TRANSCRIPT
Lecture7:Tricks+WordEmbeddings
AlanRi;er(many slides from Greg Durrett)
Recall:FeedforwardNNs
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
nfeatures
f(x)
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeatures
f(x)
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeaturesdxnmatrix
f(x)
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeaturesdxnmatrix
f(x)
nonlinearity(tanh,relu,…)
g
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix
f(x)
z
nonlinearity(tanh,relu,…)
g
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
Wf(x)
z
nonlinearity(tanh,relu,…)
g
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
soGmaxWf(x)
z
nonlinearity(tanh,relu,…)
g
P (y|x) = softmax(Wg(V f(x)))
Recall:FeedforwardNNs
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
soGmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classesprobs
Recall:BackpropagaJon
V
dhiddenunits
soGmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)err(z)
z
Recall:BackpropagaJon
V
dhiddenunits
soGmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)@z
@Verr(z)
zf(x)
ThisLecture
‣ Training
‣WordrepresentaJons
‣ word2vec/GloVe
‣ EvaluaJngwordembeddings
TrainingTips
TrainingBasics
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method
‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?
TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method
‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?
‣ Thislecture:somepracJcaltricks.TakedeeplearningoropJmizaJoncoursestounderstandthisfurther
HowdoesiniJalizaJonaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soGmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
HowdoesiniJalizaJonaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soGmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?
HowdoesiniJalizaJonaffectlearning?
V
nfeatures
dhiddenunits
dxnmatrix mxdmatrix
soGmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))
‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?
‣ Nonconvexproblem,soiniJalizaJonma;ers!
‣ Nonlinearmodel…howdoesthisaffectthings?
HowdoesiniJalizaJonaffectlearning?
‣ Nonlinearmodel…howdoesthisaffectthings?
HowdoesiniJalizaJonaffectlearning?
‣ Nonlinearmodel…howdoesthisaffectthings?
HowdoesiniJalizaJonaffectlearning?
‣ Nonlinearmodel…howdoesthisaffectthings?
‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall
HowdoesiniJalizaJonaffectlearning?
‣ Nonlinearmodel…howdoesthisaffectthings?
‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall
‣ ReLU:largerdynamicrange(allposiJvenumbers),butcanproducebigvalues,canbreakdownifeverythingistoonegaJve
HowdoesiniJalizaJonaffectlearning?
IniJalizaJon
IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
2)IniJalizetoolargeandcellsaresaturated
IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale
2)IniJalizetoolargeandcellsaresaturated
IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale
U
"�r
6
fan-in + fan-out,+
r6
fan-in + fan-out
#‣ XavieriniJalizer:
2)IniJalizetoolargeandcellsaresaturated
IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale
U
"�r
6
fan-in + fan-out,+
r6
fan-in + fan-out
#‣ XavieriniJalizer:
‣Wantvarianceofinputsandgradientsforeachlayertobethesame
2)IniJalizetoolargeandcellsaresaturated
IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange
‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale
U
"�r
6
fan-in + fan-out,+
r6
fan-in + fan-out
#‣ XavieriniJalizer:
‣Wantvarianceofinputsandgradientsforeachlayertobethesame
‣ BatchnormalizaJon(IoffeandSzegedy,2015):periodicallyshiG+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)
2)IniJalizetoolargeandcellsaresaturated
Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme
Srivastavaetal.(2014)
Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme
Srivastavaetal.(2014)
‣ FormofstochasJcregularizaJon
Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme
Srivastavaetal.(2014)
‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy
‣ FormofstochasJcregularizaJon
Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme
Srivastavaetal.(2014)
‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy
‣ FormofstochasJcregularizaJon
‣ OnelineinPytorch/Tensorflow
OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused
‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum
OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused
‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum
OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)
OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)
‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress
StructuredPredicJon‣ Fourelementsofamachinelearningmethod:
StructuredPredicJon‣ Fourelementsofamachinelearningmethod:
‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
StructuredPredicJon‣ Fourelementsofamachinelearningmethod:
‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork
StructuredPredicJon‣ Fourelementsofamachinelearningmethod:
‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork
‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)
StructuredPredicJon‣ Fourelementsofamachinelearningmethod:
‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework
‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork
‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)
‣ Training:lotsofchoicesforopJmizaJon/hyperparameters
WordRepresentaJons
WordRepresentaJons‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete
slidecredit:DanKlein
WordRepresentaJons
‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput
‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete
slidecredit:DanKlein
WordRepresentaJons
‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput
‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)
‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete
slidecredit:DanKlein
DiscreteWordRepresentaJons
Brownetal.(1992)
DiscreteWordRepresentaJons
‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)
Brownetal.(1992)
DiscreteWordRepresentaJons
0
‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)
1
Brownetal.(1992)
DiscreteWordRepresentaJons
good
enjoyablegreat
0
‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)
isgo
11
1
0
0
Brownetal.(1992)
DiscreteWordRepresentaJons
good
enjoyablegreat
0
fishcat
‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)
dog…
isgo
0
0 1 1
11
1
1
0
0
Brownetal.(1992)
DiscreteWordRepresentaJons
good
enjoyablegreat
0
fishcat
‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)
‣Maximize
dog…
isgo
0
0 1 1
11
1
1
0
0
P (wi|wi�1) = P (ci|ci�1)P (wi|ci)
Brownetal.(1992)
DiscreteWordRepresentaJons
good
enjoyablegreat
0
fishcat
‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)
‣Maximize
‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs
dog…
isgo
0
0 1 1
11
1
1
0
0
P (wi|wi�1) = P (ci|ci�1)P (wi|ci)
Brownetal.(1992)
WordEmbeddings
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
emb(interest)
emb(rates)
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
WordEmbeddings
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
emb(interest)
emb(rates)
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
‣WhatproperJesshouldthesevectorshave?
WordEmbeddings
goodenjoyable
bad
dog
great
is
WordEmbeddings
goodenjoyable
bad
dog
great
is
‣Wantavectorspacewheresimilarwordshavesimilarembeddings
WordEmbeddings
goodenjoyable
bad
dog
great
is
‣Wantavectorspacewheresimilarwordshavesimilarembeddings
themoviewasgreat
themoviewasgood
~~
WordEmbeddings
goodenjoyable
bad
dog
great
is
‣Wantavectorspacewheresimilarwordshavesimilarembeddings
themoviewasgreat
themoviewasgood
~~
WordEmbeddings
‣ Goal:comeupwithawaytoproducetheseembeddings
word2vec/GloVe
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
Mikolovetal.(2013)
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
dog
the
Mikolovetal.(2013)
d-dimensional wordembeddings
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
dog
the
+
sized
Mikolovetal.(2013)
d-dimensional wordembeddings
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
dog
the
+
sized
soGmaxMulJplybyW
Mikolovetal.(2013)
d-dimensional wordembeddings
size|V|xd
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
dog
the
+
sized
soGmaxMulJplybyW
Mikolovetal.(2013)
d-dimensional wordembeddings
size|V|xd
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
dog
the
+
sized
soGmaxMulJplybyW
goldlabel=bit,nomanuallabelingrequired!
Mikolovetal.(2013)
d-dimensional wordembeddings
size|V|xd
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
dog
the
+
sized
soGmaxMulJplybyW
goldlabel=bit,nomanuallabelingrequired!
Mikolovetal.(2013)
d-dimensional wordembeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size|V|xd
ConJnuousBag-of-Words‣ Predictwordfromcontext
thedogbittheman
‣ Parameters:dx|V|(oned-lengthvectorpervocword),|V|xdoutputparameters(W)
dog
the
+
sized
soGmaxMulJplybyW
goldlabel=bit,nomanuallabelingrequired!
Mikolovetal.(2013)
d-dimensional wordembeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size|V|xd
Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
Mikolovetal.(2013)
Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
soGmaxMulJplybyW
gold=dog
Mikolovetal.(2013)
Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
soGmaxMulJplybyW
gold=dog
P (w0|w) = softmax(We(w))
Mikolovetal.(2013)
Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
soGmaxMulJplybyW
gold=dog
‣ Anothertrainingexample:bit->the
P (w0|w) = softmax(We(w))
Mikolovetal.(2013)
Skip-Gram
thedogbittheman‣ Predictonewordofcontextfromword
bit
soGmaxMulJplybyW
gold=dog
‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)
‣ Anothertrainingexample:bit->the
P (w0|w) = softmax(We(w))
Mikolovetal.(2013)
HierarchicalSoGmaxP (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
thea
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
…
…
thea
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
‣ HierarchicalsoGmax:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
‣ HierarchicalsoGmax:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
HierarchicalSoGmax
‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG
‣ HierarchicalsoGmax:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,
…
…
thea
‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake
|V|xdparameters Mikolovetal.(2013)
P (w0|w) = softmax(We(w))
‣ log(|V|)binarydecisions
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
(bit,the)=>+1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
P (y = 1|w, c) = ew·c
ew·c + 1
Skip-GramwithNegaJveSampling
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
Skip-GramwithNegaJveSampling
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
Skip-GramwithNegaJveSampling
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
‣ ObjecJve= logP (y = 1|w, c)� 1
k
nX
i=1
logP (y = 0|wi, c)
Skip-GramwithNegaJveSampling
‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)
Mikolovetal.(2013)
(bit,the)=>+1(bit,cat)=>-1
(bit,a)=>-1(bit,fish)=>-1
‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon
wordsinsimilarcontextsselectforsimilarcvectors
P (y = 1|w, c) = ew·c
ew·c + 1
‣ ObjecJve= logP (y = 1|w, c)� 1
k
nX
i=1
logP (y = 0|wi, c)sampled
ConnecJonswithMatrixFactorizaJon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
ConnecJonswithMatrixFactorizaJon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V|
ConnecJonswithMatrixFactorizaJon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V| |V|
d
word vecs
ConnecJonswithMatrixFactorizaJon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V| |V|
d
d
|V|
contextvecsword vecs
ConnecJonswithMatrixFactorizaJon
Levyetal.(2014)
‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors
wordpaircounts
|V|
|V| |V|
d
d
|V|
contextvecsword vecs
‣ LooksalmostlikeamatrixfactorizaJon…canweinterpretitthisway?
Skip-GramasMatrixFactorizaJon
Levyetal.(2014)
|V|
|V|
Skip-GramasMatrixFactorizaJon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
numnegaJvesamples
Skip-GramasMatrixFactorizaJon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
numnegaJvesamples
Skip-GramasMatrixFactorizaJon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
numnegaJvesamples
Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:
Skip-GramasMatrixFactorizaJon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords
numnegaJvesamples
Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:
Skip-GramasMatrixFactorizaJon
Levyetal.(2014)
|V|
|V|Mij = PMI(wi, cj)� log k
PMI(wi, cj) =P (wi, cj)
P (wi)P (cj)=
count(wi,cj)D
count(wi)D
count(cj)D
‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords
numnegaJvesamples
‣ …andit’saweightedfactorizaJonproblem(weightedbywordfreq)
Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:
GloVe(GlobalVectors)
Penningtonetal.(2014)
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair
counts
|V|
|V|
GloVe(GlobalVectors)
Penningtonetal.(2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ Loss=
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair
counts
|V|
|V|
GloVe(GlobalVectors)
Penningtonetal.(2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ Loss=
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix
‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize
wordpaircounts
|V|
|V|
GloVe(GlobalVectors)
Penningtonetal.(2014)
X
i,j
f(count(wi, cj))�w>
i cj + ai + bj � log count(wi, cj))�2‣ Loss=
‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix
‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize
‣ Byfarthemostcommonwordvectorsusedtoday(5000+citaJons)
wordpaircounts
|V|
|V|
Preview:Context-dependentEmbeddings
Petersetal.(2018)
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors
‣ Context-sensiCvewordembeddings:dependonrestofthesentence
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
Preview:Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors
‣ Context-sensiCvewordembeddings:dependonrestofthesentence
‣ HugeimprovementsacrossnearlyallNLPtasksoverGloVe
they hit the ballsthey dance at balls
‣ Howtohandledifferentwordsenses?Onevectorforballs
EvaluaJon
EvaluaJngWordEmbeddings
‣WhatproperJesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
EvaluaJngWordEmbeddings
‣WhatproperJesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
EvaluaJngWordEmbeddings
‣WhatproperJesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
‣ Analogy:
EvaluaJngWordEmbeddings
‣WhatproperJesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
‣ Analogy:goodistobestassmartisto???
EvaluaJngWordEmbeddings
‣WhatproperJesoflanguageshouldwordembeddingscapture?
goodenjoyable
bad
dog
great
is
cat
wolf
Cger
was
‣ Similarity:similarwordsareclosetoeachother
‣ Analogy:
ParisistoFranceasTokyoisto???
goodistobestassmartisto???
Similarity
Levyetal.(2015)
‣ SVD=singularvaluedecomposiJononPMImatrix
Similarity
Levyetal.(2015)
‣ SVD=singularvaluedecomposiJononPMImatrix
‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisJncJonsdon’tma;erinpracJce
HypernymyDetecJon
‣ Hypernyms:detecJveisaperson,dogisaanimal
Changetal.(2017)
HypernymyDetecJon
‣ Hypernyms:detecJveisaperson,dogisaanimal
‣ DowordvectorsencodetheserelaJonships?
Changetal.(2017)
HypernymyDetecJon
‣ Hypernyms:detecJveisaperson,dogisaanimal
‣ DowordvectorsencodetheserelaJonships?
Changetal.(2017)
HypernymyDetecJon
‣ Hypernyms:detecJveisaperson,dogisaanimal
‣ word2vec(SGNS)worksbarelybe;erthanrandomguessinghere
‣ DowordvectorsencodetheserelaJonships?
Changetal.(2017)
Analogies
queen
king
Analogies
queen
king
woman
man
Analogies
queen
king
woman
man
(king-man)+woman=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
king+(woman-man)=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
king+(woman-man)=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
king+(woman-man)=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin
king+(woman-man)=queen
Analogies
queen
king
woman
man
(king-man)+woman=queen
‣Whywouldthisbe?
‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin
king+(woman-man)=queen
‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen
Analogies
Levyetal.(2015)
Analogies
Levyetal.(2015)
‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods
Analogies
Levyetal.(2015)
‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods
cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)
cos(b2, a1) + ✏
UsingSemanJcKnowledge
Faruquietal.(2015)
UsingSemanJcKnowledge
Faruquietal.(2015)
‣ StructurederivedfromaresourcelikeWordNet
UsingSemanJcKnowledge
Faruquietal.(2015)
‣ StructurederivedfromaresourcelikeWordNet
Originalvectorforfalse
Adaptedvectorforfalse
UsingSemanJcKnowledge
Faruquietal.(2015)
‣ StructurederivedfromaresourcelikeWordNet
Originalvectorforfalse
Adaptedvectorforfalse
‣ Doesn’thelpmostproblems
UsingWordEmbeddings
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata
‣ OGenworkspre;ywell
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata
‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed
‣ Fasterbecausenoneedtoupdatetheseparameters
‣ OGenworkspre;ywell
UsingWordEmbeddings
‣ Approach1:learnembeddingsasparametersfromyourdata
‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed
‣ Approach3:iniJalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters
‣Worksbestforsometasks,butnotusedforELMo
‣ OGenworkspre;ywell
ComposiJonalSemanJcs
ComposiJonalSemanJcs
‣WhatifwewantembeddingrepresentaJonsforwholesentences?
ComposiJonalSemanJcs
‣WhatifwewantembeddingrepresentaJonsforwholesentences?
‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)
ComposiJonalSemanJcs
‣WhatifwewantembeddingrepresentaJonsforwholesentences?
‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)
‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?
ComposiJonalSemanJcs
‣WhatifwewantembeddingrepresentaJonsforwholesentences?
‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)
‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?
‣WillreturntothisinafewweeksaswemoveontosyntaxandsemanJcs
Takeaways
Takeaways
‣ Lotstotunewithneuralnetworks
Takeaways
‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…
Takeaways
‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)
‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)
‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)
‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes
‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)
Takeaways
‣ Lotstotunewithneuralnetworks
‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)
‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…
‣ Hyperparameters:dimensionalityofwordembeddings,layers,…
‣ NextJme:RNNsandCNNs
‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes
‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)