language and statistics iinasmith/ls2.f06/lecture21.pdfword sense disambiguation •can a word sense...

Language and Statistics II

Lecture 21: Bootstrapping

Noah Smith

So Far …

• We’ve talked mainly about building modelsfrom either annotated data or unannotateddata.

• We’ve focused on classes of models thatpredict different kinds of structure.

• We’ve explored different ways to estimatethose models.

• Today, we focus on mixing labeled andunlabeled data.

Word Sense Disambiguation• Can a word sense disambiguation?• Homographs

– park the car vs. walk in the park– water the plant vs. work at the plant– the x and y axes vs. chopping down trees with axes– palm of my hand vs. palm tree

• Assume we know the set of senses for a word type.Can we pick the right one for ambiguous tokens intext?

• Note: the “output variable” ranges over a small,finite set. So machine learning people love WSD.

One Sense Per Discoursep(most frequent sense | more than one occurrence)

p(more than one occurrence)

One Sense Per Discourse

• This is a fancy way of saying that, within adiscourse (e.g., document), ambiguoustokens of the same type tend to becorrelated.

One Sense Per Collocation

• Certain features of the context are verystrong predictors for one sense or another.– … power plant …– … palm of …– … the park …

• This is a fancy way of saying that (some)collocations are excellent features.

The Yarowsky Algorithm

• Given: ambiguous word type w, lots of text1. Choose a few seed collocations for each

sense and label data in those collocations.

The Yarowsky Algorithm• Given: ambiguous word type w, lots of text1. Choose a few seed collocations for each sense

and label data in those collocations.2. Train a supervised classifier on the labeled

examples. (Yarowsky used a decision list.)3. Label all examples. Keep the labels about which

the supervised classifier was highly confident(above threshold).

• Optionally, exploit one-sense-per-discourse to “spread”a label throughout the discourse.

4. Go to 2.

Whence Seeds?

• Yarowsky suggests:– dictionary definitions– single defining collocate (e.g., from WordNet)– label extremely common collocations

• See Eisner & Karakos (2005) for more about seeds.

Experimental Results

Several Ways to Think About This

• Like Viterbi EM, but new features induced on eachiteration.– Yarowsky didn’t use a probability model in the

conventional way; he used a decision list.• Leveraging several assumptions about the data to

help each other– One sense per collocation (inside the decision list)– One sense per discourse (finding new collocations)

• Meta-learner in which any supervised method canbe nested!

Important Note

• Yarowsky’s algorithm is not just for wordsense! Similar algorithms have been appliedto diverse problems:– Named entity recognition– Grammatical gender prediction– Morphology learning– Bilingual lexicon induction– Parsing

Cotraining(Blum and Mitchell, 1998)

• Rather difficult paper, but rather elegant idea.• Input is x; suppose it can be broken into x1

and x2, disjoint “views” of x.• Cotraining iteratively builds two classifiers

(one on x1 and one on x2) and uses each tohelp improve the other.

Cotraining

• Given labeled examples L, unlabeledexamples U

1. Train c1 on x1 from L, and train c2 on x2from L. (B&M used Naïve Bayes.)

2. Label examples in U using c1; add those it’smost confident about for each class to L.

3. Ditto (c2).4. Go to 1.

WebKB-Course Data

• Data: CS department sites from fouruniversities

• Task: Is a given page a course web page ornot?

• X1: bag of words in the page• X2: bag of words in hyperlinks to the page

What’s Different?• The “view” formulation.

– Yarowsky has one classifier; B&M have two.

• Yarowsky allows relabeling of unlabeled examples; B&M donot.

• Yarowsky (1995) focused on particular properties of thedata and exploited them. No general claims.

• B&M (1998) were seeking a general meta-learner that couldleverage unlabeled examples; they actually gave PAC-stylelearnability results under an assumption that X1 and X2 wereconditionally independent given Y.

• Unlike EM, neither of these methods maintains posteriordistributions over the labels.

Nigam and Ghani (2000)

• Compare EM and cotraining, with the samemodel/featues. On the WebKB-Coursedataset:

Nigam and Ghani (2000)

• Ceiling effects?• Are the content/hyperlink views really independent?

(Probably not.) Semi-synthetic experiment:

• EM > Cotraining

Hybrids

• EM:||: A softly labels data; A trains :||

• Co-EM:||: A softly labels data; B trains;B softly labels data; A trains :||

• Co-training:||: A, B label a few examples; A, B train :||

• Self-training:||: A labels a few examples; A trains :||

Results (Synthetic Data)

cotraining self-training

co-EM EM

More Results• If no natural feature split is available, can split

features randomly.• On synthetic data, that actually worked better than

the smart split!• On real data, best results came from self-training

(!?!?)– Hard to draw any firm conclusions.– Possibly has to do with the supervised learner (why not

use something more powerful than Naïve Bayes?).– Ng and Cardie (2003): more mixed results, but come out

in favor of “single-view” algorithms.– Critical comment: go back to the objective function!

Abney (2004)

• “Understanding the Yarowsky Algorithm”• Entirely under-appreciated paper!• Demonstrates that certain variants of the

Yarowsky algorithm are actually optimizinglikelihood. Others are optimizing a bound onlikelihood.

• Likelihood under what model?

Understanding the Abney Understandingof the Yarowsky Algorithm

• Modifications:– Once an originally-unlabeled example is labeled, it stays labeled.– Fix threshold at 1/(# classes).

• Assumption: base learner improves KL divergencebetween empirical distribution and the base model.– Either on labeled examples only,– or overall (assuming unlabeled examples have uniform empirical)

• Yarowsky’s base learner doesn’t do this; Abney givesvariants that do.– The “DL-EM” base learners he describes essentially amount to a

single step of the EM algorithm.• The proofs are involved; the insight (I believe) is that the

algorithm starts to look more like (Viterbi) EM with somelabels fixed so they can’t change.

Cotraining for Parsing?

• Steedman et al. (2003) cotrained twoparsers.

Parser Self-training

Parser Cotraining

Steedman et al., 2003

• Also showed cross-domain improvement(WSJ and Brown corpus).

• If you start with “enough” labeled data,cotraining doesn’t help.– Similar to many other results: Merialdo (1994),

Elworthy (1994), Smith (2006), …

Semisupervised Learning: Hot

• Adaptation to new domains– Or languages! Hwa et al., 2002; Wicentowski et al.,

2001; Smith and Smith, 2004, …• Ando and Zhang (2005): use multiple tasks to

leverage unlabeled data• Lessen the cost of annotation projects (annotate

fewer examples)• Interesting theoretical topic (many papers lately)• So much unlabeled data, how could we not want to

learn from it!

Two Important Lessons

• There usually is no unqualified “best”method. All kinds of things affect this. Moresubtle questions than, “does A beat B”:– What conditions lead to better performance for A

vs. B?– What kinds of errors is A more susceptible to

than B?• Nifty ideas can often be shown (sometimes

years later) to have solid mathematicalunderpinnings.

language and statistics iinasmith/ls2.f06/lecture21.pdfword sense disambiguation •can a word sense...

Documents