natural language processing (2019)tdde09/commons/nlp-2019-5.pdf · • "e intuition is that...

48
This work is licensed under a Creative Commons Attribution 4.0 International License. Semantic analysis Marco Kuhlmann Department of Computer and Information Science Natural Language Processing (2019)

Upload: others

Post on 21-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

This work is licensed under a Creative Commons Attribution 4.0 International License.

Semantic analysis

Marco KuhlmannDepartment of Computer and Information Science

Natural Language Processing (2019)

Page 2: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Semantic analysis

• Semantic analysis is the task of mapping a word, sentence, or text to a formal representation of its meaning.

• In this section we shall focus on the meaning of words.

Page 3: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

The Principle of Compositionality

• Syntax provides the scaffolding for semantic composition. The brown dog on the mat saw the striped cat through the window.

The brown cat saw the striped dog through the window on the mat.

• Principle of Compositionality (Frege, 1848–1925)

The meaning of a complex expression is determined by its structure and the meanings of its parts. challenges include idiomatic expressions

Page 4: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Words and contexts

What do the following sentences tell us about garrotxa?

• Garrotxa is made from milk.

• Garrotxa pairs well with crusty country bread.

• Garrotxa is aged in caves to enhance mold development.

Sentences taken from the English Wikipedia

Page 5: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

The distributional principle

• The distributional principle states that words that occur in similar contexts tend to have similar meanings.

• ‘You shall know a word by the company it keeps.’ Firth (1957)

Page 6: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Word embeddings

• A word embedding is a mapping of words to points in a vector space such that nearby words (points) are similar in terms of their distributional properties. distributional principle: have similar meanings

• This idea is similar to the vector space model of information retrieval, where the dimensions of the vector space correspond to the terms that occur in a document. points = documents, nearby points = similar topic

Lin et al. (2015)

Page 7: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Co-occurrence matrix

context words

butter cake cow deer

cheese 12 2 1 0

bread 5 5 0 0

goat 0 0 6 1

sheep 0 0 7 5

targ

et w

ords

Page 8: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Co-occurrence matrix

context words

butter cake cow deer

cheese 12 2 1 0

bread 5 5 0 0

goat 0 0 6 1

sheep 0 0 7 5

targ

et w

ords

Page 9: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

From co-occurrences to word vectors

cheese

sheep

bread

cakecontext

cow context

Page 10: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Sparse vectors versus dense vectors

• The rows of co-occurrence matrices are long and sparse. length corresponds to number of context words = on the order of 104

• State-of-the-art word embeddings that are short and dense. length on the order of 102

• The intuition is that such vectors may be better at capturing generalisations, and easier to use in machine learning.

Page 11: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Simple applications of word embeddings

• finding similar words

• answering ‘odd one out’-questions

• computing the similarity of short documents

Page 12: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Recognising textual entailment

Two doctors perform surgery on patient.

Entail Doctors are performing surgery.

Neutral Two doctors are performing surgery on a man.

Contradict Two surgeons are having lunch.

Example from Bowman et al. (2015)

Page 13: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Obtaining word embeddings

• Word embeddings can be easily trained from any text corpus using available tools. word2vec, Gensim, GloVe

• Pre-trained word vectors for English, Swedish, and various other languages are available for download. word2vec, Swectors, Polyglot project, spaCy

Page 14: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Compositional structure of word embeddings

queen

king

woman

man

Page 15: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Limitations of word embeddings

• There are many different facets of ‘similarity’. Is a cat more similar to a dog or to a tiger?

• Text data does not reflect many ‘trivial’ properties of words. more ‘black sheep’ than ‘white sheep’

• Text data does reflect human biases in the real world. king – man + woman = queen, doctor – man + woman = nurse

Goldberg (2017)

Page 16: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Questions about word embeddings

• How to measure association strength? positive pointwise mutual information

• How to measure similarity? cosine similarity

• How to learn word embeddings from text? matrix factorisation, direct learning of the low-dimensional vectors

Page 17: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Pointwise mutual information

• Raw counts favour pairs that involve very common contexts. the cat, a cat will receive higher weight than cute cat, small cat

• We want a measure that favours contexts in which the target word occurs more often than other words.

• A suitable measure is pointwise mutual information (PMI):

1.*ਙ ਚ � MPH ৸ਙ ਚ৸ਙ৸ਚ

Exam

ple

from

Gol

dber

g (2

017)

Page 18: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Pointwise mutual information

• We want to use PMI to measure the associative strength between a word 𝑤 and a context 𝑐 in a data set 𝐷:

• We can estimate the relevant probabilities by counting:

1.*ਘ ਅ � MPH ৸ਘ ਅ৸ਘ৸ਅ1.*ਘ ਅ � MPH �ਘ ਅ�]৬]�ਘ�]৬] ď �ਅ�]৬] � MPH �ਘ ਅ ď ]৬]�ਘ ď �ਅ

Page 19: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Positive pointwise mutual information

• Note that PMI is infinitely small for unseen word–context pairs, and undefined for unseen target words.

• In positive pointwise mutual information (PPMI), all negative and undefined values are replaced by zero:

• Because PPMI assigns high values to rare events, it is advisable to apply a count threshold or smooth the probabilities.

11.*ਘ ਅ � NBYʐ1.*ਘ ਅ �ʞ

Page 20: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Computing PPMI on a word–context matrix

aardvark computer data pinch result sugar

apricot 0 0 0 1 0 1

pineapple 0 0 0 1 0 1

digital 0 2 1 0 1 0

information 0 1 6 0 4 0

Page 21: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Computing PPMI on a word–context matrix

aardvark computer data pinch result sugar

apricot0/19

2/19 · 0/190/19

2/19 · 3/190/19

2/19 · 7/191/19

2/19 · 2/190/19

2/19 · 5/191/19

2/19 · 2/19

pineapple0/19

2/19 · 0/190/19

2/19 · 3/190/19

2/19 · 7/191/19

2/19 · 2/190/19

2/19 · 5/191/19

2/19 · 2/19

digital0/19

4/19 · 0/192/19

4/19 · 3/191/19

4/19 · 7/190/19

4/19 · 2/191/19

4/19 · 5/190/19

4/19 · 2/19

information0/19

11/19 · 0/191/19

11/19 · 3/196/19

11/19 · 7/190/19

11/19 · 2/194/19

11/19 · 5/190/19

11/19 · 2/19

Page 22: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Computing PPMI on a word–context matrix

aardvark computer data pinch result sugar

apricot undefined log 0.00 log 0.00 log 4.75 log 0.00 log 4.75

pineapple undefined log 0.00 log 0.00 log 4.75 log 0.00 log 4.75

digital undefined log 3.17 log 0.68 log 0.00 log 0.95 log 0.00

information undefined log 0.58 log 1.48 log 0.00 log 1.38 log 0.00

Page 23: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Computing PPMI on a word–context matrix

aardvark computer data pinch result sugar

apricot 0.00 0.00 0.00 2.25 0.00 2.25

pineapple 0.00 0.00 0.00 2.25 0.00 2.25

digital 0.00 1.66 0.00 0.00 0.00 0.00

information 0.00 0.00 0.57 0.00 0.47 0.00

Page 24: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Questions about word embeddings

• How to measure association strength? positive pointwise mutual information

• How to measure similarity? cosine similarity

• How to learn word embeddings from text? matrix factorisation, direct learning of the low-dimensional vectors

Page 25: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Distance-based similarity

• If we can represent words as vectors, then we can measure word similarity as the distance between the vectors.

• Most measures of vector similarity are based on the dot product or inner product from linear algebra.

Page 26: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

The dot product

𝑣1 𝑣2 𝑤1 𝑤2

+2 +2 +2 +1

𝑣1 𝑣2 𝑤1 𝑤2

+2 +2 −2 +2

𝑣1 𝑣2 𝑤1 𝑤2

+2 +2 −2 −1

ੋ ď ੌ � k�ੋ ď ੌ � ÷�ੋ ď ੌ � ��

Page 27: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Problems with the dot product

• The dot product will be higher for vectors that represent words that have high co-occurrence counts or PPMI values.

• This means that, all other things being equal, the dot product of two words will be greater if the words are frequent.

• This makes the dot product problematic because we would like a similarity metric that is independent of frequency.

Page 28: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Cosine similarity

• We can fix the dot product as a metric by scaling down each vector to the corresponding unit vector:

• This length-normalised dot product is the cosine similarity, whose values range from −1 (opposite) to +1 (identical). cosine of the angle between the two vectors

DPTੋ ੌ � ੋ]ੋ] ď ੌ]ੌ] � ੋ ď ੌ]ੋ]]ੌ] � öਆਊ�� ਗਊਘਊƊöਆਊ�� ਗ�ਊƊöਆਊ�� ਘ�ਊ

Page 29: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

This lecture

• Introduction to word embeddings

• Learning word embeddings

Page 30: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Learning word embeddings

Page 31: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Learning word embeddings

• Word embeddings from neural language models

• word2vec: continuous bag-of-words and skip-gram

• Word embeddings via singular value decomposition

• Contextualised embeddings – ELMo and BERT

Page 32: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Neural networks as language models

the cat sat on the

mat

hidden layer

input layer

softmax layer

Page 33: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Word embeddings via neural language models

• The neural language model is trained to predict the probability of the next word being 𝑤, given the preceding words:

• Each column of the matrix 𝑾 is a dim(𝒉)-dimensional vector that is associated with some vocabulary item 𝑤.

• We can view this vector as a representation of 𝑤 that captures its compatibility with the context represented by the vector 𝒉.

О � ৸ਘ ] QSFDFEJOH XPSET � TPęNBYਲ<latexit sha1_base64="Hn/o4X4feF4WA3liz4FhUQ35RF8=">AAAFmHicjVRdT9swFA0b3Vj3BdsjL9YqJJjaqimI0kmVGCA2pHVkLV0rNRVyk9vWIrEz2yl0UX7afsie97r9h9lpB2sKEpaSOPeee2yf63v7gUeELJV+Lj14uJx59HjlSfbps+cvXq6uvfoqWMgdaDnMY7zTxwI8QqElifSgE3DAft+Ddv/iUPvbY+CCMHomJwH0fDykZEAcLJXpfLVtj7CM7DE40SSOUQ3ZVuP0ILpE9pCMgSJbwpWMFKUDLqFDdMm4K6bAxCPYQPr4Kt5MKEZx8mnHW+eruVKxlAy0ODFnk5wxG9b52vJb22VO6AOVjoeF6JqlQPYizCVxPIizdiggwM4FHkJXTSn2QfSiRIEYbSiLiwaMq4dKlFj/j4iwL3wsRwqpPyJrN+BbSDhYM78OEwE4cTbt6vaZ5xaEnHhQO2me5nX8zW8vCilxmAuFhD5rC1B6EKr5ulmEmuQ7HAOWIQehRIuUCaFIW9VfYa+4k0engToh9ma2vTg/h1GQgrmdRplmCmZuF8xqsbqAq6ZxClRIoyrlWIES5CcyvN7sGXS06XPo99UN0ruvM8qEkhRcReG5Ta2CCutFdULVdUIWZ/8kkKP7SbCbR8c6ZTU05ahrec8IneRRQq8dytR0OAnk9J060u5UxkWWQxzI21juUHiRoAHD0MP8HhzX8i+SNMP+6D4MOjGV8m0MR0QEKXSlXLgLeetaOr03aapPr6oHUpdrzRYTvz+YKxedP8mYJxQs4GxMXHCY72PqTvtCpOxHoOpUl4mqFfdIdSCfSOAdy0paSNfsRVa8uRXF0YZa3eZA4XKew6aMimR/9jvb1QRCnU5tics5l94MC/Si+hw5M86q7mKme8nipFUuVoulLzu5/YNZm1kx1o03xqZhGhVj3/hoWEbLcIwfxi/jt/Ens555n/mQOZlCHyzNYl4bcyPT+Au9asC9</latexit><latexit sha1_base64="Hn/o4X4feF4WA3liz4FhUQ35RF8=">AAAFmHicjVRdT9swFA0b3Vj3BdsjL9YqJJjaqimI0kmVGCA2pHVkLV0rNRVyk9vWIrEz2yl0UX7afsie97r9h9lpB2sKEpaSOPeee2yf63v7gUeELJV+Lj14uJx59HjlSfbps+cvXq6uvfoqWMgdaDnMY7zTxwI8QqElifSgE3DAft+Ddv/iUPvbY+CCMHomJwH0fDykZEAcLJXpfLVtj7CM7DE40SSOUQ3ZVuP0ILpE9pCMgSJbwpWMFKUDLqFDdMm4K6bAxCPYQPr4Kt5MKEZx8mnHW+eruVKxlAy0ODFnk5wxG9b52vJb22VO6AOVjoeF6JqlQPYizCVxPIizdiggwM4FHkJXTSn2QfSiRIEYbSiLiwaMq4dKlFj/j4iwL3wsRwqpPyJrN+BbSDhYM78OEwE4cTbt6vaZ5xaEnHhQO2me5nX8zW8vCilxmAuFhD5rC1B6EKr5ulmEmuQ7HAOWIQehRIuUCaFIW9VfYa+4k0engToh9ma2vTg/h1GQgrmdRplmCmZuF8xqsbqAq6ZxClRIoyrlWIES5CcyvN7sGXS06XPo99UN0ruvM8qEkhRcReG5Ta2CCutFdULVdUIWZ/8kkKP7SbCbR8c6ZTU05ahrec8IneRRQq8dytR0OAnk9J060u5UxkWWQxzI21juUHiRoAHD0MP8HhzX8i+SNMP+6D4MOjGV8m0MR0QEKXSlXLgLeetaOr03aapPr6oHUpdrzRYTvz+YKxedP8mYJxQs4GxMXHCY72PqTvtCpOxHoOpUl4mqFfdIdSCfSOAdy0paSNfsRVa8uRXF0YZa3eZA4XKew6aMimR/9jvb1QRCnU5tics5l94MC/Si+hw5M86q7mKme8nipFUuVoulLzu5/YNZm1kx1o03xqZhGhVj3/hoWEbLcIwfxi/jt/Ens555n/mQOZlCHyzNYl4bcyPT+Au9asC9</latexit><latexit sha1_base64="Hn/o4X4feF4WA3liz4FhUQ35RF8=">AAAFmHicjVRdT9swFA0b3Vj3BdsjL9YqJJjaqimI0kmVGCA2pHVkLV0rNRVyk9vWIrEz2yl0UX7afsie97r9h9lpB2sKEpaSOPeee2yf63v7gUeELJV+Lj14uJx59HjlSfbps+cvXq6uvfoqWMgdaDnMY7zTxwI8QqElifSgE3DAft+Ddv/iUPvbY+CCMHomJwH0fDykZEAcLJXpfLVtj7CM7DE40SSOUQ3ZVuP0ILpE9pCMgSJbwpWMFKUDLqFDdMm4K6bAxCPYQPr4Kt5MKEZx8mnHW+eruVKxlAy0ODFnk5wxG9b52vJb22VO6AOVjoeF6JqlQPYizCVxPIizdiggwM4FHkJXTSn2QfSiRIEYbSiLiwaMq4dKlFj/j4iwL3wsRwqpPyJrN+BbSDhYM78OEwE4cTbt6vaZ5xaEnHhQO2me5nX8zW8vCilxmAuFhD5rC1B6EKr5ulmEmuQ7HAOWIQehRIuUCaFIW9VfYa+4k0engToh9ma2vTg/h1GQgrmdRplmCmZuF8xqsbqAq6ZxClRIoyrlWIES5CcyvN7sGXS06XPo99UN0ruvM8qEkhRcReG5Ta2CCutFdULVdUIWZ/8kkKP7SbCbR8c6ZTU05ahrec8IneRRQq8dytR0OAnk9J060u5UxkWWQxzI21juUHiRoAHD0MP8HhzX8i+SNMP+6D4MOjGV8m0MR0QEKXSlXLgLeetaOr03aapPr6oHUpdrzRYTvz+YKxedP8mYJxQs4GxMXHCY72PqTvtCpOxHoOpUl4mqFfdIdSCfSOAdy0paSNfsRVa8uRXF0YZa3eZA4XKew6aMimR/9jvb1QRCnU5tics5l94MC/Si+hw5M86q7mKme8nipFUuVoulLzu5/YNZm1kx1o03xqZhGhVj3/hoWEbLcIwfxi/jt/Ens555n/mQOZlCHyzNYl4bcyPT+Au9asC9</latexit>

Page 34: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Network weights = word embeddings

context representation (hidden layer)

word representations (weight matrix)

cat mat on sat the

Intuitively, words that occur in similar contexts will have similar word representations.

Page 35: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Training word embeddings using a language model

• Initialise the word vectors with random values. typically by uniform sampling from an interval around 0

• Train the network on large volumes of text. word2vec: 100 billion words

• Word vectors will be optimised to the prediction task. Words that tend to precede the same words will get similar vectors.

Page 36: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Google’s word2vec

• Google’s word2vec implements two different training algorithms for word embeddings: continuous bag-of-words and skip-gram.

• Both algorithms obtain word embeddings from a binary prediction task: ‘Is this an actual word–context pair?’

• Positive examples are generated from a corpus. Negative examples are generated by taking 𝑘 copies of a positive example and randomly replacing the target word with some other word.

Page 37: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

The continuous bag-of-words model

𝑐1 𝑐2

sum

dot product

𝑤

sigmoid

𝑃(observed? | 𝑐1 𝑤 𝑐2)

Page 38: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

The skip-gram model

𝑃(observed? | 𝑐1 𝑤 𝑐2)

dot product

𝑐1

sigmoid

dot product

𝑤 𝑐2

sigmoid

product

Page 39: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Word embeddings via singular value decomposition

• The rows of co-occurrence matrices are long and sparse. Instead, we would like to have word vectors that are short and dense. length on the order of 102 instead of 104

• One idea is to approximate the co-occurrence matrix by another matrix with fewer columns. in practice, use tf-idf or PPMI instead of raw counts

• This problem can be solved by computing the singular value decomposition of the co-occurrence matrix.

Page 40: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Singular value decomposition

𝑴 𝑼

𝑽

=

context words

targ

et w

ords

𝜮

𝑤 × 𝑐 𝑤 × 𝑤

𝑤 × 𝑐 𝑐 × 𝑐

Landauer and Dumais (1997)

Page 41: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Truncated singular value decomposition

𝑴 𝑼

𝑽

=

context words

targ

et w

ords

𝜮

𝑤 × 𝑐 𝑤 × 𝑘

𝑘 × 𝑐 𝑐 × 𝑐

𝑘 = embedding width

Page 42: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Truncated singular value decomposition

width 200 width 100 width 50

width 20 width 10 width 5

Phot

o cr

edit:

UM

assA

mhe

rst

Page 43: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Word embeddings via singular value decomposition

• Each row of the (truncated) matrix 𝑼 is a 𝑘-dimensional vector that represents the ‘most important’ information about a word. columns orded in decreasing order of importance

• A practical problem is that computing the singular value decomposition for large matrices is expensive. but has to be done only once!

Page 44: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Connecting the two worlds

• The two algorithmic approaches that we have seen take two seemingly very different perspectives: ‘count-based’ and ‘neural’.

• However, a careful analysis reveals that the skip-gram model is implicitly computing the factorised the PPMI matrix. Levy and Goldberg (2014)

Page 45: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Contextualised embeddings: ELMo and BERT

• In the bag-of-words and skip-gram model, each word vector is the obtained from a local prediction task.

• In ELMo and BERT, each token is assigned a representation that is a function of the entire input sentence.

• The final vector for a word is a linear combination of the internal layers of a deep bidirectional recurrent neural network (LSTM).

Muppet character images from The Muppet Wiki

Page 46: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

The ELMo architecture

The play premiered

token embedding

yesterday

The play premiered yesterday<BOS> <EOS>

bi-LSTM layer 1

bi-LSTM layer 2

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Page 47: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

Learning word embeddings

• Word embeddings from neural language models

• word2vec: continuous bag-of-words and skip-gram

• Word embeddings via singular value decomposition

• Contextualised embeddings – ELMo and BERT

Page 48: Natural Language Processing (2019)TDDE09/commons/NLP-2019-5.pdf · • "e intuition is that such vectors may be better at capturing ... using available tools. word2vec, Gensim, GloVe

This lecture

• Introduction to word embeddings

• Learning word embeddings