lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...

Lecture7:Tricks+WordEmbeddings

AlanRi;er(many slides from Greg Durrett)

Recall:FeedforwardNNs

P (y|x) = softmax(Wg(V f(x)))


nfeatures

f(x)



V

nfeatures

f(x)



V

nfeaturesdxnmatrix

f(x)



V

nfeaturesdxnmatrix

f(x)

nonlinearity(tanh,relu,…)

g



V

nfeatures

dhiddenunits

dxnmatrix

f(x)

z


g



V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

Wf(x)

z


g



V

nfeatures

dhiddenunits


soGmaxWf(x)

z


g



V

nfeatures

dhiddenunits


soGmaxWf(x)

z


g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Recall:BackpropagaJon

V

dhiddenunits

soGmaxWf(x)

zg P

(y|x)


@L@W err(root)err(z)

z

Recall:BackpropagaJon

V

dhiddenunits

soGmaxWf(x)

zg P

(y|x)


@L@W err(root)@z

@Verr(z)

zf(x)

ThisLecture

‣ Training

‣WordrepresentaJons

‣ word2vec/GloVe

‣ EvaluaJngwordembeddings

TrainingTips

TrainingBasics

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method


‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?


‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

‣ Thislecture:somepracJcaltricks.TakedeeplearningoropJmizaJoncoursestounderstandthisfurther

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z


g P(y

|x)



V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z


g P(y

|x)


‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?


V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z


g P(y

|x)


‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniJalizaJonma;ers!

‣ Nonlinearmodel…howdoesthisaffectthings?



‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall



‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiJvenumbers),butcanproducebigvalues,canbreakdownifeverythingistoonegaJve


IniJalizaJon

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange


2)IniJalizetoolargeandcellsaresaturated


‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale




U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:




U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out


‣Wantvarianceofinputsandgradientsforeachlayertobethesame




U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out


‣Wantvarianceofinputsandgradientsforeachlayertobethesame

‣ BatchnormalizaJon(IoffeandSzegedy,2015):periodicallyshiG+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)


Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)



‣ FormofstochasJcregularizaJon



‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy




‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy


‣ OnelineinPytorch/Tensorflow

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:


‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework



‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork




‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)




‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopJmizaJon/hyperparameters

WordRepresentaJons

WordRepresentaJons‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete


WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete


DiscreteWordRepresentaJons

Brownetal.(1992)


‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

Brownetal.(1992)


0


1

Brownetal.(1992)


good

enjoyablegreat

0


isgo

11

1

0

0

Brownetal.(1992)


good

enjoyablegreat

0

fishcat


dog…

isgo

0

0 1 1

11

1

1

0

0

Brownetal.(1992)


good

enjoyablegreat

0

fishcat


‣Maximize

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)


good

enjoyablegreat

0

fishcat


‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

WordEmbeddings

Bothaetal.(2017)

…

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

WordEmbeddings

Bothaetal.(2017)

…

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperJesshouldthesevectorshave?

WordEmbeddings

goodenjoyable

bad

dog

great

is

WordEmbeddings

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

WordEmbeddings

goodenjoyable

bad

dog

great

is


themoviewasgreat

themoviewasgood

~~

WordEmbeddings

goodenjoyable

bad

dog

great

is


themoviewasgreat

themoviewasgood

~~

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

word2vec/GloVe

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

Mikolovetal.(2013)


thedogbittheman

dog

the

Mikolovetal.(2013)

d-dimensional wordembeddings


thedogbittheman

dog

the

+

sized

Mikolovetal.(2013)



thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

Mikolovetal.(2013)


size|V|xd


thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)


size|V|xd


thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW


Mikolovetal.(2013)


P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd


thedogbittheman

‣ Parameters:dx|V|(oned-lengthvectorpervocword),|V|xdoutputparameters(W)

dog

the

+

sized

soGmaxMulJplybyW


Mikolovetal.(2013)



size|V|xd

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

Mikolovetal.(2013)

Skip-Gram


bit

soGmaxMulJplybyW

gold=dog

Mikolovetal.(2013)

Skip-Gram


bit

soGmaxMulJplybyW

gold=dog

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Skip-Gram


bit

soGmaxMulJplybyW

gold=dog

‣ Anothertrainingexample:bit->the


Mikolovetal.(2013)

Skip-Gram


bit

soGmaxMulJplybyW

gold=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the


Mikolovetal.(2013)

HierarchicalSoGmaxP (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

Mikolovetal.(2013)


HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG


Mikolovetal.(2013)


HierarchicalSoGmax



‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)


HierarchicalSoGmax




thea

Mikolovetal.(2013)


HierarchicalSoGmax




…

…

thea

Mikolovetal.(2013)


HierarchicalSoGmax




…

…

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)


HierarchicalSoGmax




…

…

thea


Mikolovetal.(2013)


‣ log(|V|)binarydecisions

HierarchicalSoGmax


‣ HierarchicalsoGmax:



…

…

thea


Mikolovetal.(2013)



HierarchicalSoGmax




‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

…

…

thea


Mikolovetal.(2013)



HierarchicalSoGmax




‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

…

…

thea


|V|xdparameters Mikolovetal.(2013)



Skip-GramwithNegaJveSampling

Mikolovetal.(2013)


Mikolovetal.(2013)

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon


Mikolovetal.(2013)

(bit,the)=>+1



Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1



Mikolovetal.(2013)


(bit,a)=>-1(bit,fish)=>-1



Mikolovetal.(2013)




P (y = 1|w, c) = ew·c

ew·c + 1


Mikolovetal.(2013)




wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1


‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)





P (y = 1|w, c) = ew·c

ew·c + 1



Mikolovetal.(2013)





P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)



Mikolovetal.(2013)





P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)sampled

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors


Levyetal.(2014)


wordpaircounts

|V|

|V|


Levyetal.(2014)


wordpaircounts

|V|

|V| |V|

d

word vecs


Levyetal.(2014)


wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs


Levyetal.(2014)


wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaJon…canweinterpretitthisway?

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|


Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

numnegaJvesamples


Levyetal.(2014)

|V|


PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples


Levyetal.(2014)

|V|



P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:


Levyetal.(2014)

|V|



P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples



Levyetal.(2014)

|V|



P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

‣ …andit’saweightedfactorizaJonproblem(weightedbywordfreq)


GloVe(GlobalVectors)

Penningtonetal.(2014)

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

|V|

|V|



X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

|V|

|V|



X

i,j



‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

wordpaircounts

|V|

|V|



X

i,j



‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

‣ Byfarthemostcommonwordvectorsusedtoday(5000+citaJons)

wordpaircounts

|V|

|V|

Preview:Context-dependentEmbeddings

Petersetal.(2018)

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs


Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors




Petersetal.(2018)


‣ Context-sensiCvewordembeddings:dependonrestofthesentence




Petersetal.(2018)


‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverGloVe



EvaluaJon

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was



goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother



goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was


‣ Analogy:



goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was


‣ Analogy:goodistobestassmartisto???



goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was


‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisJncJonsdon’tma;erinpracJce

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

Changetal.(2017)

HypernymyDetecJon


‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

HypernymyDetecJon


‣ word2vec(SGNS)worksbarelybe;erthanrandomguessinghere

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

Analogies

queen

king

Analogies

queen

king

woman

man

Analogies

queen

king

woman

man

(king-man)+woman=queen

Analogies

queen

king

woman

man


king+(woman-man)=queen

Analogies

queen

king

woman

man


‣Whywouldthisbe?


Analogies

queen

king

woman

man


‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin


Analogies

queen

king

woman

man


‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin


‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Analogies

Levyetal.(2015)

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

UsingSemanJcKnowledge

Faruquietal.(2015)


Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet


Faruquietal.(2015)


Originalvectorforfalse

Adaptedvectorforfalse


Faruquietal.(2015)


Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

UsingWordEmbeddings

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ OGenworkspre;ywell

UsingWordEmbeddings


‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Fasterbecausenoneedtoupdatetheseparameters


UsingWordEmbeddings


‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Approach3:iniJalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,butnotusedforELMo


ComposiJonalSemanJcs


‣WhatifwewantembeddingrepresentaJonsforwholesentences?



‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)




‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?




‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?

‣WillreturntothisinafewweeksaswemoveontosyntaxandsemanJcs

Takeaways

Takeaways

‣ Lotstotunewithneuralnetworks

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

Takeaways


‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…


Takeaways





‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

Takeaways






‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

Takeaways





‣ NextJme:RNNsandCNNs


‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...

Documents