a practical introduction to distributional...
TRANSCRIPT
A practical introductionto distributional semantics
PART I: Co-occurrence matrix models
Marco Baroni
Center for Mind/Brain SciencesUniversity of Trento
Symposium on Semantic Text ProcessingBar Ilan University November 2014
Acknowledging. . .
Georgiana Dinu
COMPOSES:COMPositional Operations in SEmantic Space
The vastness of word meaning
The distributional hypothesisHarris, Charles and Miller, Firth, Wittgenstein? . . .
The meaning of a word is (can be approximatedby, learned from) the set of contexts in which itoccurs in texts
We found a little, hairy wampimuk sleepingbehind the tree
See also MacDonald & Ramscar CogSci 2001
Distributional semantic models in a nutshell“Co-occurrence matrix” models, see Yoav’s part for neural models
I Represent words through vectors recording theirco-occurrence counts with context elements in a corpus
I (Optionally) apply a re-weighting scheme to the resultingco-occurrence matrix
I (Optionally) apply dimensionality reduction techniques tothe co-occurrence matrix
I Measure geometric distance of word vectors in“distributional space” as proxy to semanticsimilarity/relatedness
Co-occurrence
he curtains open and the moon shining in on the barelyars and the cold , close moon " . And neither of the wrough the night with the moon shining so brightly , itmade in the light of the moon . It all boils down , wrsurely under a crescent moon , thrilled by ice-white
sun , the seasons of the moon ? Home , alone , Jay plam is dazzling snow , the moon has risen full and coldun and the temple of the moon , driving out of the hugin the dark and now the moon rises , full and amber a
bird on the shape of the moon over the trees in frontBut I could n’t see the moon or the stars , only the
rning , with a sliver of moon hanging among the starsthey love the sun , the moon and the stars . None of
the light of an enormous moon . The plash of flowing wman ’s first step on the moon ; various exhibits , aerthe inevitable piece of moon rock . Housing The Airsh
oud obscured part of the moon . The Allied guns behind
Extracting co-occurrence countsVariations in context features
Doc1 Doc2 Doc3
stars 38 45 2
The nearest • to Earth stories of • and theirstars 12 10
see bright shiny stars
dobj
mod
mod
1
dobj←−−see mod−−→bright mod−−→shinystars 38 45 44
Extracting co-occurrence countsVariations in the definition of co-occurrence
E.g.: Co-occurrence with words, window of size 2, scaling bydistance to target:
... two [intensely bright stars in the] night sky ...
intensely bright in thestars 0.5 1 1 0.5
Same corpus (BNC), different window sizesNearest neighbours of dog
2-word windowI catI horseI foxI petI rabbitI pigI animalI mongrelI sheepI pigeon
30-word windowI kennelI puppyI petI bitchI terrierI rottweilerI canineI catI to barkI Alsatian
From co-occurrences to vectors
bright in skystars 8 10 6sun 10 15 4dog 2 20 1
Weighting
Re-weight the counts using corpus-level statistics to reflectco-occurrence significance
Positive Pointwise Mutual Information (PPMI)
PPMI(target, ctxt) = max(0, logP(target, ctxt)
P(target)P(ctxt))
Weighting
Adjusting raw co-occurrence counts:
bright instars 385 10788 ... ← Counts
stars 43.6 5.3 ... ← PPMI
Other weighting schemes:I TF-IDFI Local Mutual InformationI Dice
See Ch4 of J.R. Curran’s thesis (2004) and S. Evert’s thesis(2007) for surveys of weighting methods
Dimensionality reduction
I Vector spaces often range from tens of thousands tomillions of context dimensions
I Some of the methods to reduce dimensionality:I Select context features based on various relevance criteriaI Random indexingI Following claimed to also have a beneficial smoothing
effect:I Singular Value DecompositionI Non-negative matrix factorizationI Probabilistic Latent Semantic AnalysisI Latent Dirichlet Allocation
The SVD factorization
Image courtesy of Yoav
Dimensionality reduction as “smoothing”
buy
sell
From geometry to similarity in meaning
stars
sun
Vectors
stars 2.5 2.1sun 2.9 3.1
Cosine similarity
cos(x, y) =〈x, y〉‖x‖‖y‖
=
∑i=ni=1 xi × yi√∑i=n
i=1 x2 ×√∑i=n
i=1 y2
Other similarity measures: Euclidean Distance, Dice, Jaccard,Lin. . .
Geometric neighbours ≈ semantic neighbours
rhino fall good singwoodpecker rise bad dancerhinoceros increase excellent whistleswan fluctuation superb mimewhale drop poor shoutivory decrease improved soundplover reduction perfect listenelephant logarithm clever recitebear decline terrific playsatin cut lucky hearsweatshirt hike smashing hiss
BenchmarksSimilarity/relatednes
E.g: Rubenstein and Goodenough, WordSim-353, MEN,SimLex-99. . .
MEN
chapel church 0.45eat strawberry 0.33jump salad 0.06bikini pizza 0.01
How: Measure correlation of model cosines with humansimilarity/relatedness judgments
Top MEN Spearman correlation for co-occurrence matrixmodels (Baroni et al. ACL 2014): 0.72
BenchmarksCategorization
E.g: Almuhareb/Poesio, ESSLLI 2008 Shared Task, Battig set
ESSLLI
VEHICLE MAMMAL
helicopter dogmotorcycle elephantcar cat
How: Feed model-produced similarity matrix to clusteringalgorithm, look at overlap between clusters and gold categories
Top ESSLLI cluster purity for co-occurrence matrix models(Baroni et al. ACL 2014): 0.84
BenchmarksSelectional preferences
E.g: Ulrike Padó, Ken McRae et al.’s data sets
Padó
eat villager obj 1.7eat pizza obj 6.8eat pizza subj 1.1
How (Erk et al. CL 2010): 1) Create “prototype” argumentvector by averaging vectors of nouns typically occurring asargument fillers (e.g., frequent objects of to eat); 2) measurecosine of target noun with prototype (e.g., cosine of villagervector with eat-object prototype vector); 3) correlate withhuman scores
Top Padó Spearman correlation for co-occurrence matrixmodels (Baroni et al. ACL 2014): 0.41
Selectional preferencesExamples from Baroni/Lenci implementation
To kill. . .object cosine with cosinekangaroo 0.51 hammer 0.26person 0.45 stone 0.25robot 0.15 brick 0.18hate 0.11 smile 0.15flower 0.11 flower 0.12stone 0.05 antibiotic 0.12fun 0.05 person 0.12book 0.04 heroin 0.12conversation 0.03 kindness 0.07sympathy 0.01 graduation 0.04
BenchmarksAnalogy
Method and data sets from Mikolov and collaborators
syntactic analogy semantic analogywork speak brother grandsonworks speaks sister granddaughter
−−−−→speaks ≈ −−−−→works −−−−→work +
−−−→speak
How: Response counts as hit only if nearest neighbour (in largevocabulary) of vector obtained with subtraction and additionoperations above is the intended one
Top accuracy for co-occurrence matrix models (Baroni etal. ACL 2014): 0.49
Distributional semantics: A general-purposerepresentation of lexical meaningBaroni and Lenci 2010
I Similarity (cord-string vs. cord-smile)I Synonymy (zenith-pinnacle)I Concept categorization (car ISA vehicle; banana ISA fruit)I Selectional preferences (eat topinambur vs. *eat sympathy)I Analogy (mason is to stone like carpenter is to wood)I Relation classification (exam-anxiety are in
CAUSE-EFFECT relation)I Qualia (TELIC ROLE of novel is to entertain)I Salient properties (car-wheels, dog-barking)I Argument alternations (John broke the vase - the vase
broke, John minces the meat - *the meat minced)
Practical recommendationsMostly from Baroni et al. ACL 2014, see more evaluation work in reading list below
I Narrow context windows are best (1, 2 words left and right)I Full matrix better than dimensionality reductionI PPMI weighting bestI Dimensionality reduction with SVD better than with NMF
An example applicationBilingual lexicon/phrase table induction from monolingual resources
Saluja et al. (ACL 2014) obtain significant improvements inEnglish-Urdu and English-Arabic BLEU scores using phrasetables enlarged with pairs induced by exploiting distributionalsimilarity structure in source and target languages
Figure credit: Mikolov et al 2013
The infinity of sentence meaning
CompositionalityThe meaning of an utterance is a function of the meaning of its partsand their composition rules (Frege 1892)
Compositional distributional semantics: What for?
Word meaning in context(Mitchell and Lapata ACL 2008)
!"#$%&%&'(#)$*+$),!!#-
!"#$%&%&'(#)$*+$,./
!"#$%&%&'(#)$*+$0-%*#-!
Paraphrase detection (Blacoeand Lapata EMNLP 2012)
0 10 20 30 40 50
010
2030
4050
dim 1
dim
2
"cookie dwarfs hopunder the crimson planet" "gingerbread gnomes
dance underthe red moon"
"red gnomes love gingerbread cookies"
"students eatcup noodles"
Compositional distributional semantics: How?
From:Simple functions
−−→very+−−−→
good+−−−−→
movie
−−−−−−−−−−−→very good movie
Mitchell and LapataACL 2008
To:Complex composition operations
Socher at al. EMNLP 2013
Some references
I Classics:I Schütze’s 1997 CSLI bookI Landauer and Dumais PsychRev 1997I Griffiths et al. PsychRev 2007
I Overviews:I Turney and Pantel JAIR 2010I Erk LLC 2012I Baroni LLC 2013I Clark to appear in Handbook of Contemporary Semantics
I Evaluation:I Sahlgren’s 2006 thesisI Bullinaria and Levy BRM 2007, 2012I Baroni, Dinu and Kruszewski ACL 2014I Kiela and Clark CVSC 2014
Fun with distributional semantics!
http://clic.cimec.unitn.it/infomap-query/
From Distributional to Distributed Semantics
The new kid on the blockI Deep learning / neural networksI “Distributed” word representations
I Feed text into neural-net. Get back “word embeddings”.I Each word is represented as a low-dimensional vector.I Vectors capture “semantics”
I word2vec (Mikolov et al)
From Distributional to Distributed Semantics
This part of the talk
I word2vec as a black boxI a peek inside the black boxI relation between word-embeddings and the distributional
representationI tailoring word embeddings to your needs usingword2vecf
word2vec
word2vec
word2vec
I dogI cat, dogs, dachshund, rabbit, puppy, poodle, rottweiler,
mixed-breed, doberman, pigI sheep
I cattle, goats, cows, chickens, sheeps, hogs, donkeys,herds, shorthorn, livestock
I novemberI october, december, april, june, february, july, september,
january, august, marchI jerusalem
I tiberias, jaffa, haifa, israel, palestine, nablus, damascuskatamon, ramla, safed
I tevaI pfizer, schering-plough, novartis, astrazeneca,
glaxosmithkline, sanofi-aventis, mylan, sanofi, genzyme,pharmacia
Working with Dense Vectors
Word Similarity
I Similarity is calculated using cosine similarity :
sim( ~dog, ~cat) =~dog · ~cat
|| ~dog|| || ~cat ||
I For normalized vectors (||x || = 1), this is equivalent to adot product:
sim( ~dog, ~cat) = ~dog · ~cat
I Normalize the vectors when loading them.
Working with Dense Vectors
Finding the most similar words to ~dog
I Compute the similarity from word ~v to all other words.
I This is a single matrix-vector product: W · ~v>
I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms
Working with Dense Vectors
Finding the most similar words to ~dog
I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>
I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms
Working with Dense Vectors
Finding the most similar words to ~dog
I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>
I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.
I FAST! for 180k words, d=300: ∼30ms
Working with Dense Vectors
Finding the most similar words to ~dog
I Compute the similarity from word ~v to all other words.I This is a single matrix-vector product: W · ~v>
I Result is a |V | sized vector of similarities.I Take the indices of the k -highest values.I FAST! for 180k words, d=300: ∼30ms
Working with Dense Vectors
Most Similar Words, in python+numpy code
W,words = load_and_normalize_vectors("vecs.txt")# W and words are numpy arrays.w2i = {w:i for i,w in enumerate(words)}
dog = W[w2i[’dog’]] # get the dog vector
sims = W.dot(dog) # compute similarities
most_similar_ids = sims.argsort()[-1:-10:-1]sim_words = words[most_similar_ids]
Working with Dense Vectors
Similarity to a group of words
I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:
W · ~cat + W · ~dog + W · ~cow
I Now find the indices of the highest values as before.
I Matrix-vector products are wasteful. Better option:
W · ( ~cat + ~dog + ~cow)
Working with Dense Vectors
Similarity to a group of words
I “Find me words most similar to cat, dog and cow”.I Calculate the pairwise similarities and sum them:
W · ~cat + W · ~dog + W · ~cow
I Now find the indices of the highest values as before.
I Matrix-vector products are wasteful. Better option:
W · ( ~cat + ~dog + ~cow)
Working with dense word vectors can be very efficient.
But where do these vectors come from?
Working with dense word vectors can be very efficient.
But where do these vectors come from?
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
I Negative SamplingI Hierarchical Softmax
Two context representations
I Continuous Bag of Words (CBOW)I Skip-grams
We’ll focus on skip-grams with negative sampling
intuitions apply for other models as well
How does word2vec work?
word2vec implements several different algorithms:
Two training methods
I Negative SamplingI Hierarchical Softmax
Two context representations
I Continuous Bag of Words (CBOW)I Skip-grams
We’ll focus on skip-grams with negative sampling
intuitions apply for other models as well
How does word2vec work?
I Represent each word as a d dimensional vector.I Represent each context as a d dimensional vector.I Initalize all vectors to random weights.I Arrange vectors in two matrices, W and C.
How does word2vec work?While more text:
I Extract a word window:A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
I w is the focus word vector (row in W ).I ci are the context word vectors (rows in C).
I Try setting the vector values such that:
σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)
is high
I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]
c1 c2 c3 w ′ c4 c5 c6
I Try setting the vector values such that:
σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)
is low
How does word2vec work?While more text:
I Extract a word window:A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
I Try setting the vector values such that:
σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)
is high
I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]
c1 c2 c3 w ′ c4 c5 c6
I Try setting the vector values such that:
σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)
is low
How does word2vec work?While more text:
I Extract a word window:A springer is [ a cow or heifer close to calving ] .
c1 c2 c3 w c4 c5 c6
I Try setting the vector values such that:
σ(w · c1)+σ(w · c2)+σ(w · c3)+σ(w · c4)+σ(w · c5)+σ(w · c6)
is high
I Create a corrupt example by choosing a random word w ′[ a cow or comet close to calving ]
c1 c2 c3 w ′ c4 c5 c6
I Try setting the vector values such that:
σ(w ′· c1)+σ(w ′· c2)+σ(w ′· c3)+σ(w ′· c4)+σ(w ′· c5)+σ(w ′· c6)
is low
How does word2vec work?
The training procedure results in:I w · c for good word-context pairs is highI w · c for bad word-context pairs is lowI w · c for ok-ish word-context pairs is neither high nor low
As a result:I Words that share many contexts get close to each other.I Contexts that share many words get close to each other.
At the end, word2vec throws away C and returns W .
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC>
The result is a matrix M in which:I Each row corresponds to a word.I Each column corresponds to a context.I Each cell correspond to w · c, an association measure
between a word and a context.
Reinterpretation
Imagine we didn’t throw away C. Consider the product WC>
The result is a matrix M in which:I Each row corresponds to a word.I Each column corresponds to a context.I Each cell correspond to w · c, an association measure
between a word and a context.
Reinterpretation
Does this remind you of something?
Very similar to SVD over distributional representation:
Reinterpretation
Does this remind you of something?Very similar to SVD over distributional representation:
Relation between SVD and word2vec
SVDI Begin with a word-context matrix.I Approximate it with a product of low rank (thin) matrices.I Use thin matrix as word representation.
word2vec (skip-grams, negative sampling)
I Learn thin word and context matrices.I These matrices can be thought of as approximating an
implicit word-context matrix.I In Levy and Goldberg (NIPS 2014) we show that this
implicit matrix is related to the well-known PPMI matrix.
Relation between SVD and word2vec
word2vec is a dimensionality reduction technique over an(implicit) word-context matrix.
Just like SVD.
With few tricks (Levy, Goldberg and Dagan, in submission) we can getSVD to perform just as well as word2vec.
However, word2vec. . .
I . . . works without building / storing the actual matrix inmemory.
I . . . is very fast to train, can use multiple threads.I . . . can easily scale to huge data and very large word
and context vocabularies.
Relation between SVD and word2vec
word2vec is a dimensionality reduction technique over an(implicit) word-context matrix.
Just like SVD.
With few tricks (Levy, Goldberg and Dagan, in submission) we can getSVD to perform just as well as word2vec.
However, word2vec. . .
I . . . works without building / storing the actual matrix inmemory.
I . . . is very fast to train, can use multiple threads.I . . . can easily scale to huge data and very large word
and context vocabularies.
Beyond word2vec
Beyond word2vec
I word2vec is factorizing a word-context matrix.I The content of this matrix affects the resulting similarities.I word2vec allows you to specify a window size.I But what about other types of contexts?
I Example: dependency contexts (Levy and Dagan, ACL 2014)
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
Australian scientist discovers star with telescope
Bag of Words (BoW) Context
Australian scientist discovers star with telescope
Syntactic Dependency Context
Australian scientist discovers star with telescope
Syntactic Dependency Context
prep_withnsubj
dobj
Australian scientist discovers star with telescope
Syntactic Dependency Context
prep_withnsubj
dobj
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
Dumbledore Sunnydale
hallows Collinwood
Hogwarts halfblood Calarts
(Harry Potter’s school) Malfoy Greendale
Snape Millfield
Related to Harry Potter
Schools
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
nondeterministic Pauling
nondeterministic Hotelling
Turing computability Heting
(computer scientist) deterministic Lessing
finitestate Hamming
Related to computability
Scientists
Online Demo!
Embedding Similarity with Different Contexts
Target Word Bag of Words (k=5) Dependencies
singing singing
dance rapping
dancing dances breakdancing
(dance gerund) dancers miming
tapdancing busking
Related todance
Gerunds
Context matters
Choose the correct contexts for your application
I larger window sizes – more topicalI dependency relations – more functional
I only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit
Context matters
Choose the correct contexts for your application
I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relations
I context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit
Context matters
Choose the correct contexts for your application
I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the message
I . . .I the sky is the limit
Context matters
Choose the correct contexts for your application
I larger window sizes – more topicalI dependency relations – more functionalI only noun-adjective relationsI only verb-subject relationsI context: time of the current messageI context: user who wrote the messageI . . .I the sky is the limit
Software
word2vecfhttps://bitbucket.org/yoavgo/word2vecf
I Extension of word2vec.I Allows saving the context matrix.I Allows using arbitraty contexts.
I Input is a (large) file of word context pairs.
Software
hyperwordshttps://bitbucket.org/omerlevy/hyperwords/
I Python library for working with either sparse or dense wordvectors (similarity, analogies).
I Scripts for creating dense representations using word2vecfor SVD.
I Scripts for creating sparse distributional representations.
Software
dissecthttp://clic.cimec.unitn.it/composes/toolkit/
I Given vector representation of words. . .I . . . derive vector representation of phrases/sentencesI Implements various composition methods
SummaryDistributional Semantics
I Words in similar contexts have similar meanings.I Represent a word by the contexts it appears in.I But what is a context?
Neural Models (word2vec)
I Represent each word as dense, low-dimensional vector.I Same intuitions as in distributional vector-space models.I Efficient to run, scales well, modest memory requirement.I Dense vectors are convenient to work with.I Still helpful to think of the context types.
SoftwareI Build your own word representations.