word sense induction using continuous vector space models mikael kågebäck, fredrik johansson,...

15
Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson * , Devdatt Dubhashi LAB, Chalmers University of Technology * Språkbanken, University of Gothenburg

Upload: blanche-jenkins

Post on 22-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

Word sense induction using continuous vector space models

Mikael Kågebäck, Fredrik Johansson, Richard Johansson*, Devdatt Dubhashi

LAB, Chalmers University of Technology*Språkbanken, University of Gothenburg

Page 2: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

2 out of 15

Word Sense Induction (WSI)• Automatic discovery of word senses.– Given a corpus discover senses of a given word,

e.g. rock

Page 3: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

3 out of 15

Applications of WSI• Novel sense detection• Temporal/Geographical word sense drift• Localized word sense lexicons– Machine translation– Text understanding– more…

Page 4: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

4 out of 15

Context clustering

1. Compute embeddings for word instances in a corpus, based on their context.

2. Cluster the space.3. Let the centroids represent the senses.

• Pioneered by Hinrich schütze (1998).• Assumption: Distributional hypothesis valid.

Page 5: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

5 out of 15

Instance-context Embeddings (ICE)• Based on word embeddings computed using

the skip-gram model.– Low rank approximate factorization of a

normalized co-occurrence matrix C.

– Context word embeddings in V and word embeddings in U.

Page 6: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

6 out of 15

Instance-context Embeddings (ICE)

Let the mean skip-gram vector representing the context form the Instance vector but:1. Apply a triangular window function2. Weight each context word using – Naturally removes stop words– Related to the PMI, Goldberg et al (2014).

Page 7: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

7 out of 15

Plotted instances for ‘paper’

Mean vector ICE

Plotted using t-sne

Page 8: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

8 out of 15

Proposed algorithm

1. Train skip gram model on the corpus.2. Compute instance representations using ICE.– One for each instance of a word in the corpus.

3. Cluster using (nonparametric) k-means.– Cluster evaluation from Pham et al. (2005).

• (Evaluation) disambiguate test data using obtained cluster centroids.

Page 9: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

9 out of 15

SemEval 2013 task 13• WSI: Identify senses in ukWaC.• WSD: Disambiguate test words – To one of the induced senses.

• Evaluation :Compare to the annotated WordNet labels.

Page 10: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

10 out of 15

Detailed results Semeval 2013 – task 13

Best baselin

e FBC: One se

nse

Best baselin

e FNMI: One per in

stance

Topic modelin

g based WSI (U

nimelb)

Language m

odeling based W

SI (AI-K

U)

Multi sense sk

ip gram (MSSG)

MSSG+ICE weights

ICE-kmeans

57%

00%

44%

35%

46% 49% 51%

00%05% 04% 05% 04% 06% 06%

Fuzzy b-cubed Fuzzy NMI

Page 11: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

11 out of 15

Detailed results Semeval 2013 – task 13

Best baselin

e FBC: One se

nse

Best baselin

e FNMI: One per in

stance

Topic modelin

g based WSI (U

nimelb)

Language m

odeling based W

SI (AI-K

U)

Multi sense sk

ip gram (MSSG)

MSSG+ICE weights

ICE-kmeans00%

02%

04%

06%

08%

10%

12%

Harmonic mean of FBC and FNMI

Page 12: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

12 out of 15

Detailed results Semeval 2013 – task 13

Topic modelin

g based WSI (U

nimelb)

Language m

odeling based W

SI (AI-K

U)

Multi sense sk

ip gram (MSSG)

MSSG+ICE weights

ICE-kmeans

-20%

-10%

0%

10%

20%

30%

40%

Total relative improvment;

33%

Total relative improvment

Axis Title

Axis Title

Page 13: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

13 out of 15

Conclusions• Using skip-gram word embeddings clearly

boost the performance of WSI.• Semantic representation for word.• Tell which context words are most important.

Page 14: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

14 out of 15

ICE profile

Page 15: Word sense induction using continuous vector space models Mikael Kågebäck, Fredrik Johansson, Richard Johansson *, Devdatt Dubhashi LAB, Chalmers University

15 out of 15

Evaluation• SemEval 2013 - task 13– ukWaC– 50 lemmas and 100 instances per lemma.• Annotated with a WordNet senses.