language and statistics iinasmith/ls2.f06/lecture21.pdfword sense disambiguation •can a word sense...
TRANSCRIPT
![Page 1: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/1.jpg)
Language and Statistics II
Lecture 21: Bootstrapping
Noah Smith
![Page 2: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/2.jpg)
So Far …
• We’ve talked mainly about building modelsfrom either annotated data or unannotateddata.
• We’ve focused on classes of models thatpredict different kinds of structure.
• We’ve explored different ways to estimatethose models.
• Today, we focus on mixing labeled andunlabeled data.
![Page 3: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/3.jpg)
Word Sense Disambiguation• Can a word sense disambiguation?• Homographs
– park the car vs. walk in the park– water the plant vs. work at the plant– the x and y axes vs. chopping down trees with axes– palm of my hand vs. palm tree
• Assume we know the set of senses for a word type.Can we pick the right one for ambiguous tokens intext?
• Note: the “output variable” ranges over a small,finite set. So machine learning people love WSD.
![Page 4: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/4.jpg)
One Sense Per Discoursep(most frequent sense | more than one occurrence)
p(more than one occurrence)
![Page 5: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/5.jpg)
One Sense Per Discourse
• This is a fancy way of saying that, within adiscourse (e.g., document), ambiguoustokens of the same type tend to becorrelated.
![Page 6: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/6.jpg)
One Sense Per Collocation
• Certain features of the context are verystrong predictors for one sense or another.– … power plant …– … palm of …– … the park …
• This is a fancy way of saying that (some)collocations are excellent features.
![Page 7: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/7.jpg)
The Yarowsky Algorithm
• Given: ambiguous word type w, lots of text1. Choose a few seed collocations for each
sense and label data in those collocations.
![Page 8: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/8.jpg)
![Page 9: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/9.jpg)
The Yarowsky Algorithm• Given: ambiguous word type w, lots of text1. Choose a few seed collocations for each sense
and label data in those collocations.2. Train a supervised classifier on the labeled
examples. (Yarowsky used a decision list.)3. Label all examples. Keep the labels about which
the supervised classifier was highly confident(above threshold).
• Optionally, exploit one-sense-per-discourse to “spread”a label throughout the discourse.
4. Go to 2.
![Page 10: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/10.jpg)
![Page 11: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/11.jpg)
![Page 12: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/12.jpg)
Whence Seeds?
• Yarowsky suggests:– dictionary definitions– single defining collocate (e.g., from WordNet)– label extremely common collocations
• See Eisner & Karakos (2005) for more about seeds.
![Page 13: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/13.jpg)
Experimental Results
![Page 14: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/14.jpg)
Several Ways to Think About This
• Like Viterbi EM, but new features induced on eachiteration.– Yarowsky didn’t use a probability model in the
conventional way; he used a decision list.• Leveraging several assumptions about the data to
help each other– One sense per collocation (inside the decision list)– One sense per discourse (finding new collocations)
• Meta-learner in which any supervised method canbe nested!
![Page 15: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/15.jpg)
Important Note
• Yarowsky’s algorithm is not just for wordsense! Similar algorithms have been appliedto diverse problems:– Named entity recognition– Grammatical gender prediction– Morphology learning– Bilingual lexicon induction– Parsing
![Page 16: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/16.jpg)
Cotraining(Blum and Mitchell, 1998)
• Rather difficult paper, but rather elegant idea.• Input is x; suppose it can be broken into x1
and x2, disjoint “views” of x.• Cotraining iteratively builds two classifiers
(one on x1 and one on x2) and uses each tohelp improve the other.
![Page 17: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/17.jpg)
Cotraining
• Given labeled examples L, unlabeledexamples U
1. Train c1 on x1 from L, and train c2 on x2from L. (B&M used Naïve Bayes.)
2. Label examples in U using c1; add those it’smost confident about for each class to L.
3. Ditto (c2).4. Go to 1.
![Page 18: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/18.jpg)
WebKB-Course Data
• Data: CS department sites from fouruniversities
• Task: Is a given page a course web page ornot?
• X1: bag of words in the page• X2: bag of words in hyperlinks to the page
![Page 19: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/19.jpg)
What’s Different?• The “view” formulation.
– Yarowsky has one classifier; B&M have two.
• Yarowsky allows relabeling of unlabeled examples; B&M donot.
• Yarowsky (1995) focused on particular properties of thedata and exploited them. No general claims.
• B&M (1998) were seeking a general meta-learner that couldleverage unlabeled examples; they actually gave PAC-stylelearnability results under an assumption that X1 and X2 wereconditionally independent given Y.
• Unlike EM, neither of these methods maintains posteriordistributions over the labels.
![Page 20: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/20.jpg)
Nigam and Ghani (2000)
• Compare EM and cotraining, with the samemodel/featues. On the WebKB-Coursedataset:
![Page 21: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/21.jpg)
Nigam and Ghani (2000)
• Ceiling effects?• Are the content/hyperlink views really independent?
(Probably not.) Semi-synthetic experiment:
• EM > Cotraining
![Page 22: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/22.jpg)
Hybrids
• EM:||: A softly labels data; A trains :||
• Co-EM:||: A softly labels data; B trains;B softly labels data; A trains :||
• Co-training:||: A, B label a few examples; A, B train :||
• Self-training:||: A labels a few examples; A trains :||
![Page 23: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/23.jpg)
Results (Synthetic Data)
cotraining self-training
co-EM EM
![Page 24: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/24.jpg)
More Results• If no natural feature split is available, can split
features randomly.• On synthetic data, that actually worked better than
the smart split!• On real data, best results came from self-training
(!?!?)– Hard to draw any firm conclusions.– Possibly has to do with the supervised learner (why not
use something more powerful than Naïve Bayes?).– Ng and Cardie (2003): more mixed results, but come out
in favor of “single-view” algorithms.– Critical comment: go back to the objective function!
![Page 25: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/25.jpg)
Abney (2004)
• “Understanding the Yarowsky Algorithm”• Entirely under-appreciated paper!• Demonstrates that certain variants of the
Yarowsky algorithm are actually optimizinglikelihood. Others are optimizing a bound onlikelihood.
• Likelihood under what model?
![Page 26: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/26.jpg)
Understanding the Abney Understandingof the Yarowsky Algorithm
• Modifications:– Once an originally-unlabeled example is labeled, it stays labeled.– Fix threshold at 1/(# classes).
• Assumption: base learner improves KL divergencebetween empirical distribution and the base model.– Either on labeled examples only,– or overall (assuming unlabeled examples have uniform empirical)
• Yarowsky’s base learner doesn’t do this; Abney givesvariants that do.– The “DL-EM” base learners he describes essentially amount to a
single step of the EM algorithm.• The proofs are involved; the insight (I believe) is that the
algorithm starts to look more like (Viterbi) EM with somelabels fixed so they can’t change.
![Page 27: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/27.jpg)
Cotraining for Parsing?
• Steedman et al. (2003) cotrained twoparsers.
![Page 28: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/28.jpg)
Parser Self-training
![Page 29: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/29.jpg)
Parser Cotraining
![Page 30: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/30.jpg)
Steedman et al., 2003
• Also showed cross-domain improvement(WSJ and Brown corpus).
• If you start with “enough” labeled data,cotraining doesn’t help.– Similar to many other results: Merialdo (1994),
Elworthy (1994), Smith (2006), …
![Page 31: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/31.jpg)
Semisupervised Learning: Hot
• Adaptation to new domains– Or languages! Hwa et al., 2002; Wicentowski et al.,
2001; Smith and Smith, 2004, …• Ando and Zhang (2005): use multiple tasks to
leverage unlabeled data• Lessen the cost of annotation projects (annotate
fewer examples)• Interesting theoretical topic (many papers lately)• So much unlabeled data, how could we not want to
learn from it!
![Page 32: Language and Statistics IInasmith/LS2.F06/lecture21.pdfWord Sense Disambiguation •Can a word sense disambiguation? •Homographs –park the car vs. walk in the park –water the](https://reader034.vdocument.in/reader034/viewer/2022043018/5f3a429361a3b122256e7eb7/html5/thumbnails/32.jpg)
Two Important Lessons
• There usually is no unqualified “best”method. All kinds of things affect this. Moresubtle questions than, “does A beat B”:– What conditions lead to better performance for A
vs. B?– What kinds of errors is A more susceptible to
than B?• Nifty ideas can often be shown (sometimes
years later) to have solid mathematicalunderpinnings.