wsta lecture 17 word sense disambiguation comp90042 what is a word’s meaning? - meaning hard to...

28
COMP90042 1 WSTA Lecture 17 Word Sense Disambiguation 1. Word meaning 2. WordNet Lexical Resource 3. Word sense disambiguation Slide credits: Steven Bird

Upload: vohanh

Post on 12-May-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 1

WSTA Lecture 17 Word Sense Disambiguation

1. Word meaning

2. WordNet Lexical Resource

3. Word sense disambiguation

Slide credits: Steven Bird

Page 2: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 2

Polysemy and Homonymy

l  Polysemy: -  e.g. plain - clear, undecorated, unattractive, level area of land

l  Homonymy: -  Homograph: different words with same orthography

l  e.g. dove - dive into water, white bird l  e.g. deal – distribute cards (verb), an agreement (noun)

-  Homophone: different words with same sound l  e.g. see, sea l  e.g. French: vert, verre, vers, ver (green, glass, towards, worm)

Page 3: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 3

What is a word’s meaning? -  Meaning hard to pin down

-  as referent: words stand for objects/concepts in the world

-  as mental image: words convey conceptual grouping

-  as context: meaning as the set of contexts in which they can occur

-  as dictionary entry: let a lexicographer decide

-  Dictionary definitions -  all definitions ultimately circular: dictionaries just give paraphrases

-  what meaning is really contained in a dictionary entry? l  cf. bilingual dictionary, giving foreign translations

l  Explicit semantic databases l  SQL statements, first order logic, …

Page 4: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 4

Hypernyms and Hyponyms l  Hypernym/Hyponym =

generic/specific

l  e.g. fork is a kind of cutlery

l  “fork” is a hyponym of “cutlery”

l  “cutlery” is a hypernym of “fork”

l  Induces forest structure on our set of words

l  Also gives a measure of semantic distance

Page 5: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 5

Holonyms and Meronyms l  Holonym/Meronym (whole/part): 3 subtypes:

1.  Part: bone is part of arm

2.  Member: arm is member of body

3.  Substance: bone is substance of horn

Page 6: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 6

Other Lexical Relationships l  Synonym/Antonym:

-  same vs complementary referential meanings

l  Hypernym/Troponym:

-  walk is a hypernym of stroll

l  To walk is one way to stroll -  stroll is a troponym of walk

l  To stroll is a particular way to walk l  Entails:

-  Walking entails stepping

-  Snoring entails sleeping

l  Many more lexical relationships exist...

Page 7: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 7

Data for Exploring Lexical Semantics

l  Thesaurus: -  Synonyms and Antonyms

l  Wordnet: -  Synonyms and Antonyms

-  Hypernyms and Hyponyms, Hypernyms and Troponyms

-  Meronyms and Holonyms

-  Entails

Page 8: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 8

WordNet: Introduction

l  A lexical database -  Inspired by psycholinguistic theories

of human lexical memory

-  Establishes a massive network of lexical items and lexical relationships

-  English wordnet l  Four categories: noun, verb, adjective, adverb l  Nouns: 120,000; Verbs: 12,000; Adjectives: 21,000; Adverbs: 4,000

-  Wordnet in other languages [www.globalwordnet.org] l  Wordnets exist or are in preparation for: Afrikaans Albanian Arabic Assamese

Bantu Basque Bengali Bodo Bulgarian Burmese Catalan Chinese Croatian Czech Danish Dutch English Estonian Finnish French French German Greek Gujarati Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Kannada Kashmiri Konkani Korean Kurdish Lao Latin Latvian Macedonian Malayalam Malaysian Maltese Marathi Meitei Moldavian Mongolian Nepali Norwegian Oriya Persian Polish Portuguese Punjabi Romanian Russian Sanskrit Serbian Sinhala Slovenian Spanish Swedish Tamil Telugu Thai Turkish Urdu Vietnamese

Page 9: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 9

Synonym Sets - Synsets

l  Words are ambiguous -  e.g. “fork” in earlier slide

-  the different senses participate in different lexical relations

l  Nodes in Wordnet represent “synonym sets”, or synsets. -  e.g. {chump, fish, fool, gull, mark, patsy, fall guy, sucker, schlemiel,

shlemiel, soft touch, mug} (a person who is gullible and easy to take advantage of)

l  Applications: -  Overcome limitations in other data (e.g. NLU)

-  Implement selectional restrictions (use WordNet categories on grammar productions, e.g., can only “eat” with a certain sense of “fork”)

Page 10: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 10

NLTK WordNet Interface >>> import nltk

>>> nltk.corpus.wordnet.synsets(‘fork’)

[Synset('fork.n.01’), Synset('branching.n.01'), Synset('fork.n.03'), Synset('fork.n.04'), Synset('crotch.n.02'), Synset('pitchfork.v.01'), Synset('fork.v.02'), Synset('branch.v.02'), Synset('fork.v.04')]

>>> nltk.corpus.wordnet.synsets('fork')[1].lemma_names()

[u'branching', u'ramification', u'fork', u'forking']

>>> nltk.corpus.wordnet.synsets('fork')[1].definition()

u'the act of branching out or dividing into branches'

>>> nltk.corpus.wordnet.synsets('fork')[1].hypernyms()

[Synset('division.n.03')]

Page 11: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 11

Word Sense Disambiguation

l  Applications: -  semantic analysis, machine translation, information retrieval, homograph

resolution in text-to-speech, sentence boundary detection, restoring accents and capitals, automatic diacritics while typing

l  Example

l  Definition and application

l  Training data – SENSEVAL, SEMCOR

l  Methods for robust WSD -  Supervised classifiers

-  Semisupervised method

-  Unsupervised clustering

l  Issues

Page 12: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 12

Example

l  “The US puts a new face on the chase for Saddam’’ -  US/n

l  First person plural inclusive pronoun l  Abbrev. The United States of America

-  Put/v.t. l  Transfer to a specified place

l  Express in words l  Propel from hand with pushing motion

-  Put/n l  Throw of shot l  Option of selling stock at a certain date

Page 13: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 13

Example (cont)

l  ''The US puts a new face on the chase for Saddam’’ -  new/a

l  Invented, discovered, previously unknown l  Fresh, further, additional

l  Different, changed, substituted for old l  Of recent growth, origin or manufacture

-  new/adv ...

-  face/n l  Front of head l  Expression, grimace l  Aspect (on the face of it...) l  ....

Page 14: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 14

The Context

l  An avalanche of competing interpretations -  5,760 different sense combinations in example

-  as sentences grow

-  exponential growth of interpretations

l  disambiguate two or more semantically distinct forms which have been conflated into the same representation in some medium (Yarowsky)

Page 15: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 15

Methods for Robust WSD

l  Simple n-gram methods won't work - why not? -  disambiguating context

-  the tag on the word

l  Popular approaches -  Feature based classifiers

-  Unsupervised methods

-  Supervised training data -  lexical sample: many labelled examples for single polysemous word

-  all-words: sense annotations for all ambiguous words in documents

Page 16: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 16

Training Data for WSD SENSEVAL ‘lexical sample’

<instance id="hard-a.br-a06:"> <answer instance="hard-a.br-a06:" senseid="HARD1"/> <context> <s snum=29> `` <p="``"/> i <p="PRP"/> find <p="VBP"/> it <p="PRP"/> <head>hard

<p="JJ"/></head> to <p="TO"/> understand <p="VB"/> how <p="WRB"/> anyone <p="NN"/> seeking <p="VBG"/> a <p="DT"/> position <p="NN"/> in <p="IN"/> public <p="JJ"/> life <p="NN"/> could <p="MD"/> demonstrate <p="VB"/> such <p="JJ"/> poor <p="JJ"/> judgment <p="NN"/> and <p="CC"/> bad <p="JJ"/> taste <p="NN"/> . <p="."/>

</context> </instance> •  Competition data from SENSEVAL events

•  SENSEVAL-1: 35 words, 12k instances •  SENSEVAL-2: 73 words, 12k instances •  SENSEVAL-3, SEMEVAL… •  Many datasets available http://www.d.umn.edu/

~tpederse/data.html

Page 17: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 17

Training Data for WSD SEMCOR ‘all-words’

<context filename="br-a01" paras="yes”> <p pnum="1”><s snum="1”>

<wf cmd="ignore" pos="DT">The</wf> <wf cmd="done" rdf="group" pos="NNP" lemma="group" wnsn="1" lexsn="1:03:00::" pn="group">Fulton_County_Grand_Jury</wf> <wf cmd="done" pos="VB" lemma="say" wnsn="1" lexsn="2:32:00::">said</wf> <wf cmd="done" pos="NN" lemma="friday" wnsn="1" lexsn="1:28:00::">Friday</wf> <wf cmd="ignore" pos="DT">an</wf>

<wf cmd="done" pos="NN" lemma="investigation" wnsn="1" lexsn="1:09:00::">investigation</wf> <wf cmd="ignore" pos="IN">of</wf> <wf cmd="done" pos="NN" lemma="atlanta" wnsn="1" lexsn="1:15:00::">Atlanta</wf> <wf cmd="ignore" pos="POS">\'s</wf> <wf cmd="done" pos="JJ" lemma="recent" wnsn="2" lexsn="5:00:00:past:00">recent</wf> …

Semcor •  352 documents from Brown corpus manually tagged for WordNet senses •  Several versions available from

http://web.eecs.umich.edu/~mihalcea/downloads.html

Page 18: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 18

Sense tagged corpora in NLTK

>>> from nltk.corpus import senseval, semcor >>> senseval.fileids()

['serve.pos', 'interest.pos', 'hard.pos', 'line.pos'] >>> senseval.instances('serve.pos’)

[SensevalInstance(word=u'serve-v', position=42, context=[('some', 'DT'), ('tart', 'JJ'), ('fruits', 'NNS'), …, ('seems', 'VBZ'), ('wiser', 'JJR'), ('to', 'TO'), ('serve', 'VB'), ('it', 'PRP'), ('plain', 'JJ'), ('with', 'IN'), ('a', 'DT'), ('good', 'JJ'), ('sharp', 'JJ'), ('cheese', 'NN'), …], senses=('SERVE10',)), …]

>>> semcor.tagged_sents(tag='sem')[0] [u'The'], Tree(Lemma('group.n.01.group'), [Tree('NE', ['Fulton', 'County', 'Grand', 'Jury'])]), Tree(Lemma('state.v.01.say'), ['said']), Tree(Lemma('friday.n.01.Friday'), ['Friday']), [u'an'], Tree(Lemma('probe.n.01.investigation'), ['investigation']), [u'of'], Tree(Lemma('atlanta.n.01.Atlanta'), ['Atlanta']), [u"'s"], Tree(Lemma('late.s.03.recent'), ['recent']), Tree(Lemma('primary.n.01.primary_election'), ['primary', 'election']), Tree(Lemma('produce.v.04.produce'), ['produced']), [u'``'], [u'no'], Tree(Lemma('evidence.n.01.evidence'), ['evidence']), [u"''"], [u'that'], [u'any'], Tree(Lemma('abnormality.n.04.irregularity'), ['irregularities']), Tree(Lemma('happen.v.01.take_place'), ['took', 'place']), [u'.']]

Page 19: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 19

Inputs: Feature Vectors 1

l  What contextual have good predictive value?

l  Local context l  E.g. Reid saw me looking at the iron bars .

NNP VBD PRP VBG IN DT NN NNS .

-  local POS around the word -  P0 = NNS; P-1 = NN; P1 = .; P-2 = DT; ...

-  unigrams and collocations -  nearby word ‘iron’; nearby word ‘gin’; … -  ‘the iron X’; ‘iron X .’; ‘the __ X’

Page 20: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 20

Inputs: Feature Vectors 2

-  Syntactic relations -  E.g., he turned his attention to the workbench

-  subject = he/PRP; object = attention/NN, active tense

-  E.g., he turned his attention to the workbench -  head = turned/VBD; active tense; head to the left

-  E.g., the modern tram is a green machine . -  head = machine/NN

-  Combine all these inputs in a classifier

Page 21: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 21

WSD as Classification

l  Supervised classification l  given feature vectors for each occurrence of our word in context l  and its label (e.g., bass/fish) l  use to train a classifier, e.g.,

l  logistic regression

l  support vector machine l  naïve Bayes, k-NN, etc

l  Evaluate performance on test data l  baseline = most frequent sense (hard to beat!) l  measure accuracy, precision and recall

l  See Lee & Ng (2002) for overview l  best accuracy for SVMs vs several other classifiers compared on

SENSEVAL-1 and 2

Page 22: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 22

Other approaches

•  Lesk’s dictionary based method •  one of the earliest approaches

•  look for terms from the dictionary definition •  difficult, hard -- (not easy; requiring great physical or mental effort to accomplish

or comprehend or endure; "a difficult task"; "nesting places on the cliffs are difficult of access"; "difficult times"; "a difficult child"; "found himself in a difficult situation"; "why is it so hard for you to keep a secret?")

•  e.g., search for effort, endure, easy, etc in the context

•  no need for explicit supervision, but depends on comprehensive definitions

Page 23: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 23

Yarowsky ‘boot-strap’ algorithm

•  Assumes two properties •  One sense per collocation: Local contexts (collocations) highly informative

of word sense (this is exploited by the classifier’s features)

•  One sense per discourse: Documents tend to use only one sense of an ambiguous word (holds 99% of the time)

•  Example •  d1 living … close-up studies of plant life and natural ...

living … many dangers to plant and animal life … ??? … cell types found in the plant kingdom are …

•  d2 factory … discovered at a St. Louis manufacturing plant . ??? … computer disk drive plant located in …

•  OSPC highly informative context words: life, manufacturing

•  OSPD resolves the difficult instances based on easy ones in same document

Page 24: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 24

Yarowsky algorithm

•  Proceed as follows, for a given focus word 1.  Sense label a small “seed” collection, to use as training

•  automatically from dictionary •  dream up one or two good examples per sense

•  annotate a few corpus examples

2.  Repeat 1.  learn a classifier on the training set 2.  predict senses for the remaining unlabelled text 3.  add the most confident predictions to training,

excluding documents that don’t obey the ‘one-sense-per-document’ heuristic

3.  Apply final classifier to test data •  use OSPD to vote for the best sense

•  Rivals supervised classification (Yarowsky, 1995)

Page 25: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 25

Unsupervised 1

l  Represent context of target word 'suit' as vectors <F1, F2, ... Fv> -  The suit against the union was successful and many workers lost

their homes to pay off the judgment. [1]

-  Mantle, more concerned with dress, buys his suits four at a time at Neiman-Marcus in Dallas and pays as much as $250 each. [2]

2 1

Pay

Union

Page 26: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 26

Unsupervised 2

l  Cluster context vectors -  Cosine, Euclidian distance

-  Hierarchical clustering, K-means, EM

-  Dimensionality reduction, e.g., latent semantic indexing (LSI)

Pay

Union

SENSE1

SENSE2

Page 27: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 27

Problems

l  Knowledge acquisition bottleneck -  Supervised methods need marked up data

l  Performance measurement? -  Non-uniform confusability

l  Incorporation into larger tasks -  e.g. parsing, translation, document retrieval

l  What is a word sense? -  Dictionaries as sense inventories

-  Bilingual texts as sense inventories

-  Dependent on contextual usage

l  Dealing with rapid language change...

Page 28: WSTA Lecture 17 Word Sense Disambiguation COMP90042 What is a word’s meaning? - Meaning hard to pin down - as referent: words stand for objects/concepts in the world - as mental

COMP90042 28

Readings

One of the following: MS 7.1, 7.3-7.4: Word Sense Disambiguation

JM 20.1-20.4.1, 20.6-20.7: Word Sense Disambiguation

Optional, for more recent overview of field

Roberto Navigli (2009), Word Sense Disambiguation: A Survey, ACM Computing Surveys, 41(2), pp. 1–69

Optional, for more details on lexical semantics: JM 19.1-19.3 Lexical Semantics