lecture 7: tricks + word embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf ·...

167
Lecture 7: Tricks + Word Embeddings Alan Ri;er (many slides from Greg Durrett)

Upload: others

Post on 28-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Lecture7:Tricks+WordEmbeddings

AlanRi;er(many slides from Greg Durrett)

Page 2: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

P (y|x) = softmax(Wg(V f(x)))

Page 3: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

nfeatures

f(x)

P (y|x) = softmax(Wg(V f(x)))

Page 4: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeatures

f(x)

P (y|x) = softmax(Wg(V f(x)))

Page 5: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeaturesdxnmatrix

f(x)

P (y|x) = softmax(Wg(V f(x)))

Page 6: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeaturesdxnmatrix

f(x)

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Page 7: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix

f(x)

z

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Page 8: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

Wf(x)

z

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Page 9: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g

P (y|x) = softmax(Wg(V f(x)))

Page 10: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:FeedforwardNNs

V

nfeatures

dhiddenunits

dxnmatrix num_classesxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))num_classesprobs

Page 11: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:BackpropagaJon

V

dhiddenunits

soGmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)err(z)

z

Page 12: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Recall:BackpropagaJon

V

dhiddenunits

soGmaxWf(x)

zg P

(y|x)

P (y|x) = softmax(Wg(V f(x)))

@L@W err(root)@z

@Verr(z)

zf(x)

Page 13: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ThisLecture

‣ Training

‣WordrepresentaJons

‣ word2vec/GloVe

‣ EvaluaJngwordembeddings

Page 14: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

TrainingTips

Page 15: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

TrainingBasics

Page 16: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

Page 17: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

Page 18: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

TrainingBasics‣ Basicformula:computegradientsonbatch,usefirst-orderopt.method

‣ HowtoiniJalize?Howtoregularize?WhatopJmizertouse?

‣ Thislecture:somepracJcaltricks.TakedeeplearningoropJmizaJoncoursestounderstandthisfurther

Page 19: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

Page 20: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

Page 21: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HowdoesiniJalizaJonaffectlearning?

V

nfeatures

dhiddenunits

dxnmatrix mxdmatrix

soGmaxWf(x)

z

nonlinearity(tanh,relu,…)

g P(y

|x)

P (y|x) = softmax(Wg(V f(x)))

‣ HowdoweiniJalizeVandW?Whatconsequencesdoesthishave?

‣ Nonconvexproblem,soiniJalizaJonma;ers!

Page 22: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

‣ Nonlinearmodel…howdoesthisaffectthings?

HowdoesiniJalizaJonaffectlearning?

Page 23: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

‣ Nonlinearmodel…howdoesthisaffectthings?

HowdoesiniJalizaJonaffectlearning?

Page 24: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

‣ Nonlinearmodel…howdoesthisaffectthings?

HowdoesiniJalizaJonaffectlearning?

Page 25: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall

HowdoesiniJalizaJonaffectlearning?

Page 26: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

‣ Nonlinearmodel…howdoesthisaffectthings?

‣ IfcellacJvaJonsaretoolargeinabsolutevalue,gradientsaresmall

‣ ReLU:largerdynamicrange(allposiJvenumbers),butcanproducebigvalues,canbreakdownifeverythingistoonegaJve

HowdoesiniJalizaJonaffectlearning?

Page 27: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon

Page 28: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

Page 29: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

2)IniJalizetoolargeandcellsaresaturated

Page 30: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

2)IniJalizetoolargeandcellsaresaturated

Page 31: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:

2)IniJalizetoolargeandcellsaresaturated

Page 32: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

2)IniJalizetoolargeandcellsaresaturated

Page 33: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

IniJalizaJon1)Can’tusezeroesforparameterstoproducehiddenlayers:allvaluesinthathiddenlayerarealways0andhavegradientsof0,neverchange

‣ Candorandomuniform/normaliniJalizaJonwithappropriatescale

U

"�r

6

fan-in + fan-out,+

r6

fan-in + fan-out

#‣ XavieriniJalizer:

‣Wantvarianceofinputsandgradientsforeachlayertobethesame

‣ BatchnormalizaJon(IoffeandSzegedy,2015):periodicallyshiG+rescaleeachlayertohavemean0andvariance1overabatch(usefulifnetisdeep)

2)IniJalizetoolargeandcellsaresaturated

Page 34: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

Page 35: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ FormofstochasJcregularizaJon

Page 36: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasJcregularizaJon

Page 37: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Dropout‣ ProbabilisJcallyzerooutpartsofthenetworkduringtrainingtopreventoverfidng,usewholenetworkattestJme

Srivastavaetal.(2014)

‣ Similartobenefitsofensembling:networkneedstoberobusttomissingsignals,soithasredundancy

‣ FormofstochasJcregularizaJon

‣ OnelineinPytorch/Tensorflow

Page 38: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

Page 39: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

OpJmizer‣ Adam(KingmaandBa,ICLR2015)isverywidelyused

‣ AdapJvestepsizelikeAdagrad,incorporatesmomentum

Page 40: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

Page 41: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

OpJmizer‣Wilsonetal.NIPS2017:adapJvemethodscanactuallyperformbadlyattestJme(Adamisinpink,SGDinblack)

‣ Checkdevsetperiodically,decreaselearningrateifnotmakingprogress

Page 42: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

Page 43: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

Page 44: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

Page 45: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

Page 46: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

StructuredPredicJon‣ Fourelementsofamachinelearningmethod:

‣Model:feedforward,RNNs,CNNscanbedefinedinauniformframework

‣ ObjecJve:manylossfuncJonslooksimilar,justchangesthelastlayeroftheneuralnetwork

‣ Inference:definethenetwork,yourlibraryofchoicetakescareofit(mostly…)

‣ Training:lotsofchoicesforopJmizaJon/hyperparameters

Page 47: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordRepresentaJons

Page 48: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordRepresentaJons‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

Page 49: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

Page 50: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordRepresentaJons

‣ ConJnuousmodel<->expectsconJnuoussemanJcsfrominput

‣ “Youshallknowawordbythecompanyitkeeps”Firth(1957)

‣ NeuralnetworksworkverywellatconJnuousdata,butwordsarediscrete

slidecredit:DanKlein

Page 51: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

Brownetal.(1992)

Page 52: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

Brownetal.(1992)

Page 53: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

0

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

1

Brownetal.(1992)

Page 54: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

good

enjoyablegreat

0

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

isgo

11

1

0

0

Brownetal.(1992)

Page 55: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

dog…

isgo

0

0 1 1

11

1

1

0

0

Brownetal.(1992)

Page 56: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

‣Maximize

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

Page 57: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

DiscreteWordRepresentaJons

good

enjoyablegreat

0

fishcat

‣ Brownclusters:hierarchicalagglomeraJvehardclustering(eachwordhasonecluster,notsomeposteriordistribuJonlikeinmixturemodels)

‣Maximize

‣ UsefulfeaturesfortaskslikeNER,notsuitableforNNs

dog…

isgo

0

0 1 1

11

1

1

0

0

P (wi|wi�1) = P (ci|ci�1)P (wi|ci)

Brownetal.(1992)

Page 58: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

Page 59: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordEmbeddings

Bothaetal.(2017)

Fedraisesinterestratesinorderto…

f(x)?? emb(raises)

‣Wordembeddingsforeachwordforminput

emb(interest)

emb(rates)

previousword

currword

nextword

otherwords,feats,etc.

‣ Part-of-speechtaggingwithFFNNs

‣WhatproperJesshouldthesevectorshave?

Page 60: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

WordEmbeddings

Page 61: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

goodenjoyable

bad

dog

great

is

WordEmbeddings

Page 62: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

WordEmbeddings

Page 63: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings

Page 64: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

goodenjoyable

bad

dog

great

is

‣Wantavectorspacewheresimilarwordshavesimilarembeddings

themoviewasgreat

themoviewasgood

~~

WordEmbeddings

‣ Goal:comeupwithawaytoproducetheseembeddings

Page 65: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

word2vec/GloVe

Page 66: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

Mikolovetal.(2013)

Page 67: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

Mikolovetal.(2013)

d-dimensional wordembeddings

Page 68: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

Mikolovetal.(2013)

d-dimensional wordembeddings

Page 69: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

Page 70: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

Page 71: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

size|V|xd

Page 72: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

Page 73: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConJnuousBag-of-Words‣ Predictwordfromcontext

thedogbittheman

‣ Parameters:dx|V|(oned-lengthvectorpervocword),|V|xdoutputparameters(W)

dog

the

+

sized

soGmaxMulJplybyW

goldlabel=bit,nomanuallabelingrequired!

Mikolovetal.(2013)

d-dimensional wordembeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size|V|xd

Page 74: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

Mikolovetal.(2013)

Page 75: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

Mikolovetal.(2013)

Page 76: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Page 77: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Page 78: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-Gram

thedogbittheman‣ Predictonewordofcontextfromword

bit

soGmaxMulJplybyW

gold=dog

‣ Parameters:dx|V|vectors,|V|xdoutputparameters(W)(alsousableasvectors!)

‣ Anothertrainingexample:bit->the

P (w0|w) = softmax(We(w))

Mikolovetal.(2013)

Page 79: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmaxP (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 80: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 81: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 82: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 83: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 84: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 85: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 86: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

Page 87: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

Page 88: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

‣ HierarchicalsoGmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

Page 89: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

‣ HierarchicalsoGmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

Page 90: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HierarchicalSoGmax

‣Matmul+soGmaxover|V|isveryslowtocomputeforCBOWandSG

‣ HierarchicalsoGmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ StandardsoGmax:[|V|xd]xd log(|V|)dotproductsofsized,

thea

‣ Huffmanencodevocabulary,usebinaryclassifierstodecidewhichbranchtotake

|V|xdparameters Mikolovetal.(2013)

P (w0|w) = softmax(We(w))

‣ log(|V|)binarydecisions

Page 91: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

Page 92: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Page 93: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Page 94: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Page 95: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

Page 96: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

P (y = 1|w, c) = ew·c

ew·c + 1

Page 97: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

Page 98: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

Page 99: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)

Page 100: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramwithNegaJveSampling

‣ dx|V|vectors,dx|V|contextvectors(same#ofparamsasbefore)

Mikolovetal.(2013)

(bit,the)=>+1(bit,cat)=>-1

(bit,a)=>-1(bit,fish)=>-1

‣ Take(word,context)pairsandclassifythemas“real”ornot.CreaterandomnegaJveexamplesbysamplingfromunigramdistribuJon

wordsinsimilarcontextsselectforsimilarcvectors

P (y = 1|w, c) = ew·c

ew·c + 1

‣ ObjecJve= logP (y = 1|w, c)� 1

k

nX

i=1

logP (y = 0|wi, c)sampled

Page 101: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

Page 102: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V|

Page 103: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

word vecs

Page 104: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

Page 105: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ConnecJonswithMatrixFactorizaJon

Levyetal.(2014)

‣ Skip-grammodellooksatword-wordco-occurrencesandproducestwotypesofvectors

wordpaircounts

|V|

|V| |V|

d

d

|V|

contextvecsword vecs

‣ LooksalmostlikeamatrixfactorizaJon…canweinterpretitthisway?

Page 106: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|

Page 107: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

numnegaJvesamples

Page 108: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Page 109: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

numnegaJvesamples

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

Page 110: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

Page 111: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Skip-GramasMatrixFactorizaJon

Levyetal.(2014)

|V|

|V|Mij = PMI(wi, cj)� log k

PMI(wi, cj) =P (wi, cj)

P (wi)P (cj)=

count(wi,cj)D

count(wi)D

count(cj)D

‣ IfwesamplenegaJveexamplesfromtheuniformdistribuJonoverwords

numnegaJvesamples

‣ …andit’saweightedfactorizaJonproblem(weightedbywordfreq)

Skip-gramobjecJveexactlycorrespondstofactoringthismatrix:

Page 112: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

GloVe(GlobalVectors)

Penningtonetal.(2014)

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

|V|

|V|

Page 113: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix wordpair

counts

|V|

|V|

Page 114: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

wordpaircounts

|V|

|V|

Page 115: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

GloVe(GlobalVectors)

Penningtonetal.(2014)

X

i,j

f(count(wi, cj))�w>

i cj + ai + bj � log count(wi, cj))�2‣ Loss=

‣ Alsooperatesoncountsmatrix,weighted regressiononthelogco-occurrencematrix

‣ Constantinthedatasetsize(justneedcounts),quadraJcinvocsize

‣ Byfarthemostcommonwordvectorsusedtoday(5000+citaJons)

wordpaircounts

|V|

|V|

Page 116: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Preview:Context-dependentEmbeddings

Petersetal.(2018)

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 117: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 118: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 119: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 120: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 121: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Preview:Context-dependentEmbeddings

Petersetal.(2018)

‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,useitsinternalrepresentaJonsaswordvectors

‣ Context-sensiCvewordembeddings:dependonrestofthesentence

‣ HugeimprovementsacrossnearlyallNLPtasksoverGloVe

they hit the ballsthey dance at balls

‣ Howtohandledifferentwordsenses?Onevectorforballs

Page 122: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

EvaluaJon

Page 123: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

Page 124: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

Page 125: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

Page 126: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:goodistobestassmartisto???

Page 127: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

EvaluaJngWordEmbeddings

‣WhatproperJesoflanguageshouldwordembeddingscapture?

goodenjoyable

bad

dog

great

is

cat

wolf

Cger

was

‣ Similarity:similarwordsareclosetoeachother

‣ Analogy:

ParisistoFranceasTokyoisto???

goodistobestassmartisto???

Page 128: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

Page 129: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Similarity

Levyetal.(2015)

‣ SVD=singularvaluedecomposiJononPMImatrix

‣ GloVedoesnotappeartobethebestwhenexperimentsarecarefullycontrolled,butitdependsonhyperparameters+thesedisJncJonsdon’tma;erinpracJce

Page 130: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

Changetal.(2017)

Page 131: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

Page 132: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

Page 133: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

HypernymyDetecJon

‣ Hypernyms:detecJveisaperson,dogisaanimal

‣ word2vec(SGNS)worksbarelybe;erthanrandomguessinghere

‣ DowordvectorsencodetheserelaJonships?

Changetal.(2017)

Page 134: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

Page 135: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

Page 136: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

Page 137: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

Page 138: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

Page 139: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

king+(woman-man)=queen

Page 140: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

king+(woman-man)=queen

Page 141: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

king+(woman-man)=queen

Page 142: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

Page 143: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

queen

king

woman

man

(king-man)+woman=queen

‣Whywouldthisbe?

‣ woman-mancapturesthedifferenceinthecontextsthattheseoccurin

king+(woman-man)=queen

‣ Dominantchange:more“he”withmanand“she”withwoman—similartodifferencebetweenkingandqueen

Page 144: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

Levyetal.(2015)

Page 145: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

Page 146: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Analogies

Levyetal.(2015)

‣ Thesemethodscanperformwellonanalogiesontwodifferentdatasetsusingtwodifferentmethods

cos(b, a2 � a1 + b1)Maximizingforb:Add= Mul=cos(b2, a2) cos(b2, b1)

cos(b2, a1) + ✏

Page 147: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingSemanJcKnowledge

Faruquietal.(2015)

Page 148: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Page 149: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

Page 150: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingSemanJcKnowledge

Faruquietal.(2015)

‣ StructurederivedfromaresourcelikeWordNet

Originalvectorforfalse

Adaptedvectorforfalse

‣ Doesn’thelpmostproblems

Page 151: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingWordEmbeddings

Page 152: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ OGenworkspre;ywell

Page 153: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Fasterbecausenoneedtoupdatetheseparameters

‣ OGenworkspre;ywell

Page 154: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

UsingWordEmbeddings

‣ Approach1:learnembeddingsasparametersfromyourdata

‣ Approach2:iniJalizeusingGloVe/ELMo,keepfixed

‣ Approach3:iniJalizeusingGloVe,fine-tune‣ Fasterbecausenoneedtoupdatetheseparameters

‣Worksbestforsometasks,butnotusedforELMo

‣ OGenworkspre;ywell

Page 155: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ComposiJonalSemanJcs

Page 156: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

Page 157: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

Page 158: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?

Page 159: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

ComposiJonalSemanJcs

‣WhatifwewantembeddingrepresentaJonsforwholesentences?

‣ Skip-thoughtvectors(Kirosetal.,2015),similartoskip-gramgeneralizedtoasentencelevel(morelater)

‣ IsthereawaywecancomposevectorstomakesentencerepresentaJons?Summing?

‣WillreturntothisinafewweeksaswemoveontosyntaxandsemanJcs

Page 160: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

Page 161: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks

Page 162: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

Page 163: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

Page 164: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

Page 165: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

Page 166: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)

Page 167: Lecture 7: Tricks + Word Embeddingsaritter.github.io/courses/5525_slides_v2/lec7-nn2.pdf · 2020-07-11 · 0 cat fish ‣ Brown clusters: hierarchical agglomeraJve hard clustering

Takeaways

‣ Lotstotunewithneuralnetworks

‣Wordvectors:learningword->contextmappingshasgivenwaytomatrixfactorizaJonapproaches(constantindatasetsize)

‣ Training:opJmizer,iniJalizer,regularizaJon(dropout),…

‣ Hyperparameters:dimensionalityofwordembeddings,layers,…

‣ NextJme:RNNsandCNNs

‣ LotsofpretrainedembeddingsworkwellinpracJce,theycapturesomedesirableproperJes

‣ Evenbe;er:context-sensiJvewordembeddings(ELMo)