sémantique distributionnelle, embeddings (et dong ...felipe/ift6285-automne2018/transp/di… ·...

55
BD Deep Eval emantique distributionnelle, embeddings (et dong) [email protected] RALI Dept. Informatique et Recherche Op ´ erationnelle Universit ´ e de Montr ´ eal V0.1 Last compiled: 24 novembre 2018 [email protected] emantique distributionnelle, embeddings (et dong)

Upload: others

Post on 15-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Semantique distributionnelle embeddings (et dong)

felipeiroumontrealca

RALIDept Informatique et Recherche Operationnelle

Universite de Montreal

V01 Last compiled 24 novembre 2018

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

I If A and B have almost identical environments we say thatthey are synonyms (Harris 1954)

I you shall know a word by the company it keeps (Firth 1957)I words which are similar in meaning occur in similar contexts

(Rubenstein amp Goodenough 1965)I In other words difference of meaning correlates with

difference of distribution (Harris 1970 p786)I words with similar meanings will occur with similar neighbors if

enough text material is available (Schutze amp Pedersen 1995)I a representation that captures much of how words are used in

natural context will capture much of what we mean by meaning(Landauer amp Dumais 1997)

I in the proposed model it will so generalize because ldquosimilarrdquowords are expected to have a similar feature vector andbecause the probability function is a smooth function of thesefeature values a small change in the features will induce a smallchange in the probability (Bengio et al 2003)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Modele vectoriel (Vector Space model)

I lire [Turney and Pantel 2010] pour une introduction

I lire [Baroni and Lenci 2010] pour une generalisation (tenseur)

1 une matrice de ldquocomptesrdquo de co-occurences2 un schema de ponderation (PMI LLR etc)3 une politique de reduction de dimensionnalite

singular value decomposition [Golub and Van Loan 1996]non-negative matrix factorization [Lee and Seung 1999]aucune (tres bon baseline)etc

I DISSECT offre les etapes 2 et 3

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesdocument

I similarite de documents

I hypothese bag of word si une requete et un document ont desrepresentations (colonnes) similaires alors ils vehiculent lameme information [Salton 1975]

I implemente (par exemple) dans Lucene

Pris de [Jurafsky and Martin 2015]felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 2: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

I If A and B have almost identical environments we say thatthey are synonyms (Harris 1954)

I you shall know a word by the company it keeps (Firth 1957)I words which are similar in meaning occur in similar contexts

(Rubenstein amp Goodenough 1965)I In other words difference of meaning correlates with

difference of distribution (Harris 1970 p786)I words with similar meanings will occur with similar neighbors if

enough text material is available (Schutze amp Pedersen 1995)I a representation that captures much of how words are used in

natural context will capture much of what we mean by meaning(Landauer amp Dumais 1997)

I in the proposed model it will so generalize because ldquosimilarrdquowords are expected to have a similar feature vector andbecause the probability function is a smooth function of thesefeature values a small change in the features will induce a smallchange in the probability (Bengio et al 2003)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Modele vectoriel (Vector Space model)

I lire [Turney and Pantel 2010] pour une introduction

I lire [Baroni and Lenci 2010] pour une generalisation (tenseur)

1 une matrice de ldquocomptesrdquo de co-occurences2 un schema de ponderation (PMI LLR etc)3 une politique de reduction de dimensionnalite

singular value decomposition [Golub and Van Loan 1996]non-negative matrix factorization [Lee and Seung 1999]aucune (tres bon baseline)etc

I DISSECT offre les etapes 2 et 3

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesdocument

I similarite de documents

I hypothese bag of word si une requete et un document ont desrepresentations (colonnes) similaires alors ils vehiculent lameme information [Salton 1975]

I implemente (par exemple) dans Lucene

Pris de [Jurafsky and Martin 2015]felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 3: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

I If A and B have almost identical environments we say thatthey are synonyms (Harris 1954)

I you shall know a word by the company it keeps (Firth 1957)I words which are similar in meaning occur in similar contexts

(Rubenstein amp Goodenough 1965)I In other words difference of meaning correlates with

difference of distribution (Harris 1970 p786)I words with similar meanings will occur with similar neighbors if

enough text material is available (Schutze amp Pedersen 1995)I a representation that captures much of how words are used in

natural context will capture much of what we mean by meaning(Landauer amp Dumais 1997)

I in the proposed model it will so generalize because ldquosimilarrdquowords are expected to have a similar feature vector andbecause the probability function is a smooth function of thesefeature values a small change in the features will induce a smallchange in the probability (Bengio et al 2003)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Modele vectoriel (Vector Space model)

I lire [Turney and Pantel 2010] pour une introduction

I lire [Baroni and Lenci 2010] pour une generalisation (tenseur)

1 une matrice de ldquocomptesrdquo de co-occurences2 un schema de ponderation (PMI LLR etc)3 une politique de reduction de dimensionnalite

singular value decomposition [Golub and Van Loan 1996]non-negative matrix factorization [Lee and Seung 1999]aucune (tres bon baseline)etc

I DISSECT offre les etapes 2 et 3

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesdocument

I similarite de documents

I hypothese bag of word si une requete et un document ont desrepresentations (colonnes) similaires alors ils vehiculent lameme information [Salton 1975]

I implemente (par exemple) dans Lucene

Pris de [Jurafsky and Martin 2015]felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 4: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

I If A and B have almost identical environments we say thatthey are synonyms (Harris 1954)

I you shall know a word by the company it keeps (Firth 1957)I words which are similar in meaning occur in similar contexts

(Rubenstein amp Goodenough 1965)I In other words difference of meaning correlates with

difference of distribution (Harris 1970 p786)I words with similar meanings will occur with similar neighbors if

enough text material is available (Schutze amp Pedersen 1995)I a representation that captures much of how words are used in

natural context will capture much of what we mean by meaning(Landauer amp Dumais 1997)

I in the proposed model it will so generalize because ldquosimilarrdquowords are expected to have a similar feature vector andbecause the probability function is a smooth function of thesefeature values a small change in the features will induce a smallchange in the probability (Bengio et al 2003)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Modele vectoriel (Vector Space model)

I lire [Turney and Pantel 2010] pour une introduction

I lire [Baroni and Lenci 2010] pour une generalisation (tenseur)

1 une matrice de ldquocomptesrdquo de co-occurences2 un schema de ponderation (PMI LLR etc)3 une politique de reduction de dimensionnalite

singular value decomposition [Golub and Van Loan 1996]non-negative matrix factorization [Lee and Seung 1999]aucune (tres bon baseline)etc

I DISSECT offre les etapes 2 et 3

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesdocument

I similarite de documents

I hypothese bag of word si une requete et un document ont desrepresentations (colonnes) similaires alors ils vehiculent lameme information [Salton 1975]

I implemente (par exemple) dans Lucene

Pris de [Jurafsky and Martin 2015]felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 5: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Modele vectoriel (Vector Space model)

I lire [Turney and Pantel 2010] pour une introduction

I lire [Baroni and Lenci 2010] pour une generalisation (tenseur)

1 une matrice de ldquocomptesrdquo de co-occurences2 un schema de ponderation (PMI LLR etc)3 une politique de reduction de dimensionnalite

singular value decomposition [Golub and Van Loan 1996]non-negative matrix factorization [Lee and Seung 1999]aucune (tres bon baseline)etc

I DISSECT offre les etapes 2 et 3

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesdocument

I similarite de documents

I hypothese bag of word si une requete et un document ont desrepresentations (colonnes) similaires alors ils vehiculent lameme information [Salton 1975]

I implemente (par exemple) dans Lucene

Pris de [Jurafsky and Martin 2015]felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 6: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

matrice de co-occurence termetimesdocument

I similarite de documents

I hypothese bag of word si une requete et un document ont desrepresentations (colonnes) similaires alors ils vehiculent lameme information [Salton 1975]

I implemente (par exemple) dans Lucene

Pris de [Jurafsky and Martin 2015]felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 7: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

matrice de co-occurence termetimesterme

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 8: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

matrice de co-occurence termetimesrel

I similarite de termes

I hypothese distributionnelle si deux termes ont desrepresentations (lignes) similaires alors il sont similaires

a

a Pris de [Jurafsky and Martin 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 9: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

matrice de co-occurence termestimespatron

I similarite de relations

I hypothese si deux paires de mots ont des representations(lignes) similaires alors elles sont similaires X of Y Y of X X forY Y for X X to Y et Y to X

I une liste de 64 mots comme of for ou toI formant 128 patrons (colonnes) contenant la paire (XY)

[Turney 2005]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 10: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 11: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Au menu

I un modele vedette Word2Vec [Mikolov et al 2013a]

I proprietes des embeddings [Mikolov et al 2013d Mikolov et al 2013c]

I des resultats glory [Baroni et al 2014] moderation[Levy et al 2015]

I cool works [Faruqui and Dyer 2015 Faruqui et al 2015bFaruqui et al 2015a]

I modeles bilingues [Mikolov et al 2013b Chandar et al 2014Gouws et al 2015 Coulmance et al 2016]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 12: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une revolution chez les ldquodistributionnalistesrdquo Word2Vec [Mikolov et al 2013a]

I un toolkit rapide implementant deux modeles

I httpscodegooglecomarchivepword2vecI https

radimrehurekcomgensimmodelsword2vechtmlI httpsgithubcomdavword2vec

I des embeddings disponibles entraınes sur 6B de mots deGoogle News (180K mots) - dimension = 300

I directement utilisable dans de nombreuses applications

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 13: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Les 2 modeles de Word2Vec[Mikolov et al 2013a]

I Skip-gram est le plus populaire (plus fiable pour les ldquopetitsrdquocorpus)

I CBOW est plus rapide (bien pour les grands corpus)felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 14: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I C un corpus drsquoentraınement aka un ensemble D de paires(w c) ou w est un mot de C et c est un mot vu dans un contextenote le modele represente differemment les mots de contextedes mots du vocabulaire

I Soit (w c) appartient-elle a C p(D = 1|w c θ) la probabiliteassociee

I Optimise par descente de gradient

L = argmaxθ

prod(wc)isinD

p(D = 1|w c θ)prod

(wc)isinDprime1minus p(D = 1|w c θ)

ou vc (resp vw) est le vecteur de c (resp w)

I Dprime est construit en choisissant k paires aleatoirement selon lesdistributions unigrammes (des mots et des mots de contextes)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 15: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Skip-gram [Mikolov et al 2013a]

I en posant σ(x) = 11+eminusx p(D = 1|w c θ) = σ(vcvw) alors

L = argmaxθ

sum(wc)isinD

log σ(vcvw) +sum

(wc)isinDprimelog σ(minusvcvw)

I les contextes sont definis par une fenetre centree autours dumot w considere et dont la taille est tiree aleatoirement (etuniformement sur un intervalle fixe)

I les mots les plus frequents sont sous-echantillonnes (retiresaleatoirement de C) et les mots peu frequents sont elimines(cut-off)

I ca marche (lire [Levy and Goldberg 2014] pour une explication)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 16: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Polyglot [Al-Rfou et al 2013]I 100 langues (Wikipedia)I entraıne a scorer des phrases du corpus mieux que des phrases

dans lesquelles ont a remplace un mot

I FastText [Bojanowski et al 2016]I 294 langues (Wikipedia)I skip-gram ou les mots sont representes par des sacs de n-grams

(caractere) Un embedding pour un mot inconnu peut donc etrecalcule

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 17: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Autres embeddings pre-entraınes

I Glove [Pennington et al 2014]

glove6Bzip (Wikipedia+GigaWord 2014 |V |=400Kd isin 50 100 200 300 822Mo)

glove42B300dzip (Common Crawl |V |=19M uncasedd = 300 175 Go)

glove840B300dzip (Common Crawl |V |=22M casedd = 300 203 Go)

glovetwitter27Bzip (2B tweets |V |=12M uncasedd isin 25 50 100 200 142 Go)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 18: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Arithmetique analogique des representations[Mikolov et al 2013d]

I vec(Madrid) - vec(Spain) vec(Paris) - vec(France)

I permet de resoudre des equations analogiques [x y z ]

1 calculer t = vec(y)minus vec(x) + vec(z) le vecteur cible2 rechercher dans V le mot t le plus proche de t

t = argmaxw

vec(w)vec(t)

||vec(w)|| times ||vec(t)||

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 19: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013d]

I RNN entraıne sur 320M de mots (V = 82k)

I test set de 8k analogies impliquant les mots les plus frequents

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 20: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I 6B de mots de Google News 1M de mots les plus frequents

I le test syntaxique est le meme que dans [Mikolov et al 2013d]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 21: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Comparaison a drsquoautres modeles proposes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 22: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

[Mikolov et al 2013c]

I Big Data (plus de donnees dimension plus elevee)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 23: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Embeddings meta

I idee peut-on combiner plusieurs representations vectoriellespour en creer de nouvelles plus efficaes

I 2 approches simples mais neanmoins utiles (meilleurs resultatsque les representations isolees)

I concatener les representations [Bollegala and Bao 2018]I les moyenner (normaliser padder les representations de plus

faible dimension avec des 0) [Coates and Bollegala 2018]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 24: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I plein de taches une etude des meta-parametres de chaquemethode

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 25: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

I cnt = count vector pre = word2Vec dm =[Baroni and Lenci 2010] cw = [Collobert et al 2011]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 26: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Donrsquot count predict [Baroni et al 2014]

we set out to conduct this study because we were annoyed bythe triumphalist overtones often surrounding predict modelsdespite the almost complete lack of proper comparison to countvectors Our secret wish was to discover that it is all hype andcount vectors are far superior to their predictive counterparts we found that the predict models are so good that while thetriumphalist overtones still sound excessive there are verygood reasons to switch to the new architecture

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 27: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I en utilisant des ressources linguistiques (WordNet PTBFrameNet etc)

I vecteurs tres creux

I comparables en performance aux modeles distributionnels etatde lrsquoart entraınes sur des billions de mots

I vecteurs disponibles (pour lrsquoanglais) httpsgithubcommfaruquinon-distributional

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 28: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 29: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

features (binaires) induitspour film

SYNSETFILMV01SYNSETFILMN01

HYPOCOLLAGEFILMN01HYPER SHEETN06

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 30: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

supersenses pour les noms les verbes et les adjectifsex lioness rArr SSNOUNANIMAL

color lexique mot-couleur elabore par crowdsourcing[Mohammad 2011]ex blood rArr COLORRED

emotion lexique associant un mot a sa polarite(positifnegatif) et aux emotions (joie peurtristesse etc) elabore par crowdsourcing[Mohammad and Turney 2013]ex cannibal rArr POLNEG EMODISGUST etEMOFEARCOLORRED

pos PTB part-of-speech tagsex loverArr PTBNOUN PTBVERB

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 31: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I note difficile a faire pour toutes les langues

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 32: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Representation vectorielle binaire (nondistributionnelle) [Faruqui and Dyer 2015]

I Skip-Gram pre-entraıne sur 300B de mots[Mikolov et al 2013a]

I Glove pre-entraıne sur 6B de mots [Pennington et al 2014]I LSA obtenue a partir drsquoune matrice de co-occurrence calculee

sur 1B de mots de Wikipedia [Turney and Pantel 2010]I Ling Dense reduction de dimensionnalite avec SVDI taches similarite sent analysis (positifnegatif) NP-bracketing

(local (phone company) versus (local phone) company )felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 33: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Retrofitting de vecteurs a une ressourcelexico-semantique [Faruqui et al 2015a]

I etape de post-traitement applicable a nrsquoimporte quellerepresentation vectorielle de mots

I rapide (5 secondes pour 100k mots et dimension 300)

I idee utiliser les informations lexico-semantiques drsquouneressource pour ameliorer une representation existante

I comment encourager que les mots de distance similaire dansla representation apprise soit proche de la representation induitede la ressource (encodee sous forme de graphe)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 34: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Une communaute qui srsquoorganise[Faruqui and Dyer 2014]

I des embeddings deja entraınes

I une suite de tests qui peuvent srsquoexecuter (similarite analogiecompletion etc)

I une interface de visualisation

I note pas certain que le site soit tres populaire (ni mis a jour)pour le moment

I httpwordvectorsorgdemophp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 35: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 36: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I on peut apprendre une transformation lineaire (rotation +scaling) drsquoun espace vers un autre avec un lexique bilingue(xi zi)

W = minW

Σi Wxi minus zi2

ou xi et zi designent respectivement la representationvectorielle source de xi et cible de zi

I W optimisee par descente de gradient sur un lexique drsquoenviron5k paires de mots

I au moment du test traduire un mot x par z

z = argmaxz

cos(z Wx)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 37: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

I 6K des most sources lesplus frequents traduits parGoogleTrans

I premieres 5K entreespour calculer W

I 1K suivantes pour lestests

I baselines edit-distanceεminusRapp

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 38: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Mikolov strikes again [Mikolov et al 2013b]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 39: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval W2V Ana Meta Eval Cool Bi

Plus de donnees (Google News)

I meme split 5K train 1Ktest

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 40: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Plan

(Before Deep) modele vectoriel

And then came the ldquoDeeprdquoWord2VecAnalogieMeta-embeddingsEvaluationIdees interessantesLe cas bilingue

Evaluation

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 41: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

I comparent 4 approches matrice de co-occurrence (PMI) SVDSkip-Gram et GloVe

I etudient leurs parametres en detail

I adaptent des choix faits dans Skip-Gram a drsquoautres methodeslorsque possible

I Bilan

I match nul en performance (pas drsquoavantage clair drsquoune approchesur une autre)

I Skip-Gram se comporte mieux (tempsmemoire) que les autresapproches

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 42: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Sur la difficulte drsquoevaluer sans biais[Levy et al 2015]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 43: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Exemple drsquoobservation [Levy et al 2015]

I dans lrsquoapproche matrice de co-occurences un mot w et soncontexte c est note

PMI(w c) = logp(w c)

p(w)p(c)

I une approche courante est de mettre a 0 les valeurs de PMIlorsque (w c) = 0 (plutot que minusinfin)

I une autre est de prendre PPMI(w c) = max(PMI(w c) 0)

I adaptation de choix faits dans Skip-Gram

I

SPPMI(w c) = max(PMI(w c)minus logk 0)I sampling des k examples negatifs (lisses avec α = 075)

PMIα(w c) = logP (w c)

p(w)Pα(c)avec Pα(c) =

(c)αsumc(c)α

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 44: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

[Schnabel et al 2015]

I recommandent de ne pas utiliser une tache extrinseque pourevaluer des embeddings pre-entraınes

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 45: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

[Antoniak and Mimno 2018]

I word2vec skipgram relance plusieurs fois avec les memesparametres

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 46: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Et pour les mots peu frequents[Jakubina and Langlais 2017]

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 47: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Et pour les mots peu frequents

1k-low 1k-highTOP1 TOP5 TOP20 TOP1 TOP5 TOP20

embedding 22 61 119 217 342 449context 20 43 76 190 327 443document 07 23 50 mdash mdash mdash

oracle 46 mdash 190 318 mdash 576

I Wikipedia dump de juin 2013 (EN 35M FR 13M articles)

I VEN = 73M VFR = 36M

I 2 test sets 1k-low (1k mots rares) 1k-high (1k mots non rares)

I rare = freq lt 26 (92 des mots de VEN)

felipeiroumontrealca Semantique distributionnelle embeddings (et dong)

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 48: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Al-Rfou R Perozzi B and Skiena S (2013)Polyglot Distributed word representations for multilingual nlpIn Proceedings of the Seventeenth Conference onComputational Natural Language Learning pages 183ndash192Sofia Bulgaria Association for Computational Linguistics

Antoniak M and Mimno D (2018)Evaluating the stability of embedding-based word similaritiesTransactions of the Association for Computational Linguistics6 107ndash119

Baroni M Dinu G and Kruszewski G (2014)Donrsquot count predict a systematic comparison ofcontext-counting vs context-predicting semantic vectorsIn Proceedings of the 52nd Annual Meeting of the Associationfor Computational Linguistics (Volume 1 Long Papers) pages238ndash247 Baltimore Maryland Association for ComputationalLinguistics

Baroni M and Lenci A (2010)

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 49: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Distributional memory A general framework for corpus-basedsemanticsComput Linguist 36(4) 673ndash721

Bojanowski P Grave E Joulin A and Mikolov T(2016)Enriching word vectors with subword informationarXiv preprint arXiv 160704606

Bollegala D and Bao C (2018)Learning word meta-embeddings by autoencodingIn Proceedings of the 27th International Conference onComputational Linguistics pages 1650ndash1661 Association forComputational Linguistics

Chandar A P S Lauly S Larochelle H KhapraM M Ravindran B Raykar V C and Saha A (2014)An autoencoder approach to learning bilingual wordrepresentationsCoRR

Coates J and Bollegala D (2018)

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 50: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Frustratingly easy meta-embedding ndash computingmeta-embeddings by averaging source word embeddingsIn Conference of the North American Chapter of the Associationfor Computational Linguistics Human Language TechnologiesVolume 2 (Short Papers) pages 194ndash198

Collobert R Weston J Bottou L Karlen MKavukcuoglu K and Kuksa P (2011)Natural language processing (almost) from scratchJournal of Machine Learning Research 12 2493ndash2537

Coulmance J Marty J Wenzek G and BenhalloumA (2016)Trans-gram fast cross-lingual word-embeddingsCoRR abs160102502

Faruqui M Dodge J Jauhar S K Dyer C Hovy Eand Smith N A (2015a)Retrofitting word vectors to semantic lexiconsIn Proceedings of NAACL

Faruqui M and Dyer C (2014)

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 51: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Community evaluation and exchange of word vectors atwordvectorsorgIn Proceedings of ACL System Demonstrations

Faruqui M and Dyer C (2015)Non-distributional word vector representationsIn Proceedings of ACL

Faruqui M Tsvetkov Y Yogatama D Dyer C andSmith N A (2015b)Sparse overcomplete word vector representationsIn Proceedings of ACL

Golub G H and Van Loan C F (1996)Matrix Computations (3rd Ed)Johns Hopkins University Press

Gouws S Bengio Y and Corrado G (2015)Bilbowa Fast bilingual distributed representations without wordalignmentsIn ICML

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 52: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Jakubina L and Langlais P (2017)Reranking translation candidates produced by several bilingualword similarity sourcesIn 15th Conference of the European Chapter of the Associationfor Computational Linguitics volume 2 Short Papers pages605ndash611

Jurafsky D and Martin J H (2015)Speech and language processing(3rd ed draft)

Lee D D and Seung H S (1999)Learning the parts of objects by non-negative matrixfactorizationNature 401(6755) 788ndash791

Levy O and Goldberg Y (2014)Neural word embedding as implicit matrix factorizationIn Advances in Neural Information Processing Systems 27pages 2177ndash2185

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 53: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Levy O Goldberg Y and Dagan I (2015)Improving distributional similarity with lessons learned from wordembeddingsTransactions of the Association for Computational Linguistics3 211ndash225

Mikolov T Chen K Corrado G and Dean J (2013a)

Efficient estimation of word representations in vector spaceCoRR abs13013781

Mikolov T Le Q V and Sutskever I (2013b)Exploiting similarities among languages for machine translationCoRR abs13094168

Mikolov T Sutskever I Chen K Corrado G andDean J (2013c)Distributed representations of words and phrases and theircompositionalityCoRR abs13104546

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 54: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Mikolov T tau Yih W and Zweig G (2013d)Linguistic regularities in continuous space word representationsIn Proceedings of the 2013 Conference of the North AmericanChapter of the Association for Computational Linguistics Human Language Technologies (NAACL-HLT-2013)

Mohammad S (2011)Colourful language Measuring word-colour associationsIn 2Nd Workshop on Cognitive Modeling and ComputationalLinguistics CMCL rsquo11 pages 97ndash106

Mohammad S and Turney P D (2013)Crowdsourcing a word-emotion association lexiconCoRR

Pennington J Socher R and Manning C D (2014)Glove Global vectors for word representationIn Empirical Methods in Natural Language Processing (EMNLP)pages 1532ndash1543

Salton G (1975)

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation
Page 55: Sémantique distributionnelle, embeddings (et dong ...felipe/IFT6285-Automne2018/Transp/di… · BDDeepEval Semantique distributionnelle, embeddings (et dong)´ felipe@iro.umontreal.ca

BD Deep Eval

Dynamic information and library processing Gerard SaltonPrentice-Hall Englewood Cliffs NJ

Schnabel T Labutov I Mimno D M and JoachimsT (2015)Evaluation methods for unsupervised word embeddingsIn Marquez L Callison-Burch C Su J Pighin D andMarton Y editors EMNLP pages 298ndash307 The Associationfor Computational Linguistics

Turney P D (2005)Measuring semantic similarity by latent relational analysisCoRR

Turney P D and Pantel P (2010)From frequency to meaning Vector space models of semantics

J Artif Int Res 37(1) 141ndash188

  • (Before Deep) modegravele vectoriel
  • And then came the ``Deep
    • Word2Vec
    • Analogie
    • Meta-embeddings
    • Eacutevaluation
    • Ideacutees inteacuteressantes
    • Le cas bilingue
      • Eacutevaluation