statistical nlp spring 2011
DESCRIPTION
Statistical NLP Spring 2011. Lecture 6: POS / Phrase MT. Dan Klein – UC Berkeley. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A. Parts-of-Speech (English). One basic kind of linguistic structure: syntactic word classes. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/1.jpg)
Statistical NLPSpring 2011
Lecture 6: POS / Phrase MTDan Klein – UC Berkeley
![Page 2: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/2.jpg)
Parts-of-Speech (English) One basic kind of linguistic structure: syntactic word classes
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snowsee
registered
can
had
yellow
slowly
to with
off up
the some
and or
he its
Numbers
122,312
one
![Page 3: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/3.jpg)
CC conjunction, coordinating and both but either or
CD numeral, cardinal mid-1890 nine-thirty 0.5 one
DT determiner a all an every no that the
EX existential there there
FW foreign word gemeinschaft hund ich jeux
IN preposition or conjunction, subordinating among whether out on by if
JJ adjective or numeral, ordinal third ill-mannered regrettable
JJR adjective, comparative braver cheaper taller
JJS adjective, superlative bravest cheapest tallest
MD modal auxiliary can may might will would
NN noun, common, singular or mass cabbage thermostat investment subhumanity
NNP noun, proper, singular Motown Cougar Yvette Liverpool
NNPS noun, proper, plural Americans Materials States
NNS noun, common, plural undergraduates bric-a-brac averages
POS genitive marker ' 's
PRP pronoun, personal hers himself it we them
PRP$ pronoun, possessive her his mine my our ours their thy your
RB adverb occasionally maddeningly adventurously
RBR adverb, comparative further gloomier heavier less-perfectly
RBS adverb, superlative best biggest nearest worst
RP particle aboard away back by on open through
TO "to" as preposition or infinitive marker to
UH interjection huh howdy uh whammo shucks heck
VB verb, base form ask bring fire see take
VBD verb, past tense pleaded swiped registered saw
VBG verb, present participle or gerund stirring focusing approaching erasing
VBN verb, past participle dilapidated imitated reunifed unsettled
VBP verb, present tense, not 3rd person singular twist appear comprise mold postpone
VBZ verb, present tense, 3rd person singular bases reconstructs marks uses
WDT WH-determiner that what whatever which whichever
WP WH-pronoun that what whatever which who whom
WP$ WH-pronoun, possessive whose
WRB Wh-adverb however whenever where why
![Page 4: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/4.jpg)
Part-of-Speech Ambiguity
Words can have multiple parts of speech
Two basic sources of constraint: Grammatical environment Identity of the current word
Many more possible features: Suffixes, capitalization, name databases (gazetteers), etc…
Fed raises interest rates 0.5 percent
NNP NNS NN NNS CD NNVBN VBZ VBP VBZVBD VB
![Page 5: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/5.jpg)
Why POS Tagging? Useful in and of itself (more than you’d think)
Text-to-speech: record, lead Lemmatization: saw[v] see, saw[n] saw Quick-and-dirty NP-chunk detection: grep {JJ | NN}* {NN | NNS}
Useful as a pre-processing step for parsing Less tag ambiguity means fewer parses However, some tag choices are better decided by parsers
DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …
DT NNP NN VBD VBN RP NN NNSThe Georgia branch had taken on loan commitments …
IN
VDN
![Page 6: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/6.jpg)
Classic Solution: HMMs We want a model of sequences s and observations w
Assumptions: States are tag n-grams Usually a dedicated start and end state / word Tag/state sequence is generated by a markov model Words are chosen independently, conditioned only on the tag/state These are totally broken assumptions: why?
s1 s2 sn
w1 w2 wn
s0
![Page 7: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/7.jpg)
States States encode what is relevant about the past Transitions P(s|s’) encode well-formed tag sequences
In a bigram tagger, states = tags
In a trigram tagger, states = tag pairs
<,>
s1 s2 sn
w1 w2 wn
s0
< , t1> < t1, t2> < tn-1, tn>
<>
s1 s2 sn
w1 w2 wn
s0
< t1> < t2> < tn>
![Page 8: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/8.jpg)
Estimating Transitions
Use standard smoothing methods to estimate transitions:
Can get a lot fancier (e.g. KN smoothing) or use higher orders, but in this case it doesn’t buy much
One option: encode more into the state, e.g. whether the previous word was capitalized (Brants 00)
BIG IDEA: The basic approach of state-splitting turns out to be very important in a range of tasks
)(ˆ)1()|(ˆ),|(ˆ),|( 211121221 iiiiiiiii tPttPtttPtttP
![Page 9: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/9.jpg)
Estimating Emissions
Emissions are trickier: Words we’ve never seen before Words which occur with tags we’ve never seen them with One option: break out the Good-Turning smoothing Issue: unknown words aren’t black boxes:
Basic solution: unknown words classes (affixes or shapes)
[Brants 00] used a suffix trie as its emission model
343,127.23 11-year Minteria reintroducibly
D+,D+.D+ D+-x+ Xx+ x+-“ly”
![Page 10: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/10.jpg)
Disambiguation (Inference) Problem: find the most likely (Viterbi) sequence under the model
Given model parameters, we can score any tag sequence
In principle, we’re done – list all possible tag sequences, score each one, pick the best one (the Viterbi state sequence)
Fed raises interest rates 0.5 percent .
NNP VBZ NN NNS CD NN .
P(NNP|<,>) P(Fed|NNP) P(VBZ|<NNP,>) P(raises|VBZ) P(NN|VBZ,NNP)…..
NNP VBZ NN NNS CD NN
NNP NNS NN NNS CD NN
NNP VBZ VB NNS CD NN
logP = -23
logP = -29
logP = -27
<,> <,NNP> <NNP, VBZ> <VBZ, NN> <NN, NNS> <NNS, CD> <CD, NN> <STOP>
![Page 11: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/11.jpg)
Finding the Best Trajectory Too many trajectories (state sequences) to list Option 1: Beam Search
A beam is a set of partial hypotheses Start with just the single empty trajectory At each derivation step:
Consider all continuations of previous hypotheses Discard most, keep top k, or those within a factor of the best
Beam search works ok in practice … but sometimes you want the optimal answer … and you need optimal answers to validate your beam search … and there’s usually a better option than naïve beams
<>
Fed:NNP
Fed:VBN
Fed:VBD
Fed:NNP raises:NNS
Fed:NNP raises:VBZ
Fed:VBN raises:NNS
Fed:VBN raises:VBZ
![Page 12: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/12.jpg)
The State Lattice / Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
![Page 13: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/13.jpg)
The State Lattice / Trellis
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
![Page 14: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/14.jpg)
The Viterbi Algorithm
Dynamic program for computing
The score of a best path up to position i ending in state s
Also can store a backtrace (but no one does)
Memoized solution Iterative solution
)...,...(max)( 1110... 10
iisss
i wwsssPsi
)'()'|()'|(max)( 1'
sswPssPs is
i
otherwise
sifs
0
,1)(0
)'()'|()'|(maxarg)( 1'
sswPssPs is
i
![Page 15: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/15.jpg)
So How Well Does It Work?
Choose the most common tag 90.3% with a bad unknown word model 93.7% with a good one
TnT (Brants, 2000): A carefully smoothed trigram tagger Suffix trees for emissions 96.7% on WSJ text (SOA is ~97.5%)
Noise in the data Many errors in the training and test corpora
Probably about 2% guaranteed errorfrom noise (on this data)
NN NN NNchief executive officer
JJ NN NNchief executive officer
JJ JJ NNchief executive officer
NN JJ NNchief executive officer
DT NN IN NN VBD NNS VBDThe average of interbank offered rates plummeted …
![Page 16: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/16.jpg)
Overview: Accuracies Roadmap of (known / unknown) accuracies:
Most freq tag: ~90% / ~50%
Trigram HMM: ~95% / ~55%
TnT (HMM++): 96.2% / 86.0%
Maxent P(t|w): 93.7% / 82.6% MEMM tagger: 96.9% / 86.9% Cyclic tagger: 97.2% / 89.0% Upper bound: ~98%
Most errors on unknown
words
![Page 17: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/17.jpg)
Common Errors Common errors [from Toutanova & Manning 00]
NN/JJ NN
official knowledge
VBD RP/IN DT NN
made up the story
RB VBD/VBN NNS
recently sold shares
![Page 18: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/18.jpg)
Corpus-Based MT
Modeling correspondences between languages
Sentence-aligned parallel corpus:
Yo lo haré mañanaI will do it tomorrow
Hasta prontoSee you soon
Hasta prontoSee you around
Yo lo haré prontoNovel Sentence
I will do it soon
I will do it around
See you tomorrow
Machine translation system:
Model of translation
![Page 19: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/19.jpg)
![Page 20: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/20.jpg)
Phrase-Based Systems
Sentence-aligned corpus
cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …
Phrase table(translation model)Word alignments
Many slides and examples from Philipp Koehn or John DeNero
![Page 21: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/21.jpg)
Phrase-Based Decoding
这 7 人 中包括 来自 法国 和 俄罗斯 的 宇航 员 .
Decoder design is important: [Koehn et al. 03]
![Page 22: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/22.jpg)
The Pharaoh “Model”
[Koehn et al, 2003]
Segmentation Translation Distortion
![Page 23: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/23.jpg)
The Pharaoh “Model”
Where do we get these counts?
![Page 24: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/24.jpg)
Phrase Weights
![Page 25: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/25.jpg)
Phrase-Based Decoding
![Page 26: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/26.jpg)
Monotonic Word Translation
Cost is LM * TM It’s an HMM?
P(e|e-1,e-2)
P(f|e)
State includes Exposed English Position in foreign
Dynamic program loop?
a <- t
o 0.8
a <- by 0.1
[…. a slap, 5]0.00001
[…. slap to, 6]0.00000016
[…. slap by, 6]0.00000001
a slap to
0.02
a slap by 0.01
for (fPosition in 1…|f|) for (eContext in allEContexts) for (eOption in translations[fPosition]) score = scores[fPosition-1][eContext] * LM(eContext) * TM(eOption, fWord[fPosition]) scores[fPosition][eContext[2]+eOption] =max score
![Page 27: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/27.jpg)
Beam Decoding For real MT models, this kind of dynamic program is a disaster
(why?) Standard solution is beam search: for each position, keep track of
only the best k hypotheses
Still pretty slow… why? Useful trick: cube pruning (Chiang 2005)
for (fPosition in 1…|f|) for (eContext in bestEContexts[fPosition]) for (eOption in translations[fPosition]) score = scores[fPosition-1][eContext] * LM(eContext) * TM(eOption, fWord[fPosition]) bestEContexts.maybeAdd(eContext[2]+eOption, score)
Example from David Chiang
![Page 28: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/28.jpg)
Phrase Translation
If monotonic, almost an HMM; technically a semi-HMM
If distortion… now what?
for (fPosition in 1…|f|) for (lastPosition < fPosition) for (eContext in eContexts) for (eOption in translations[fPosition]) … combine hypothesis for (lastPosition ending in eContext) with eOption
![Page 29: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/29.jpg)
Non-Monotonic Phrasal MT
![Page 30: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/30.jpg)
Pruning: Beams + Forward Costs
Problem: easy partial analyses are cheaper Solution 1: use beams per foreign subset Solution 2: estimate forward costs (A*-like)
![Page 31: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/31.jpg)
The Pharaoh Decoder
![Page 32: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/32.jpg)
Hypotheis Lattices
![Page 33: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/33.jpg)
![Page 34: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/34.jpg)
Better Features
Can do surprisingly well just looking at a word by itself:
Word the: the DT Lowercased word Importantly: importantly RB Prefixes unfathomable: un- JJ Suffixes Surprisingly: -ly RB Capitalization Meridian: CAP NNP Word shapes 35-year: d-x JJ
Then build a maxent (or whatever) model to predict tag Maxent P(t|w): 93.7% / 82.6% s3
w3
![Page 35: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/35.jpg)
Why Linear Context is Useful Lots of rich local information!
We could fix this with a feature that looked at the next word
We could fix this by linking capitalized words to their lowercase versions
Solution: discriminative sequence models (MEMMs, CRFs)
Reality check: Taggers are already pretty good on WSJ journal text… What the world needs is taggers that work on other text! Though: other tasks like IE have used the same methods to good effect
PRP VBD IN RB IN PRP VBD .They left as soon as he arrived .
NNP NNS VBD VBN .Intrinsic flaws remained undetected .
RB
JJ
![Page 36: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/36.jpg)
Sequence-Free Tagging?
What about looking at a word and its environment, but no sequence information?
Add in previous / next word the __ Previous / next word shapes X __ X Occurrence pattern features [X: x X occurs] Crude entity detection __ ….. (Inc.|Co.) Phrasal verb in sentence? put …… __ Conjunctions of these things
All features except sequence: 96.6% / 86.8% Uses lots of features: > 200K Why isn’t this the standard approach?
t3
w3 w4w2
![Page 37: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/37.jpg)
Feature-Rich Sequence Models Problem: HMMs make it hard to work with arbitrary
features of a sentence
Example: name entity recognition (NER)
Prev Cur Next
State Other ??? ???
Word at Grace Road
Tag IN NNP NNP
Sig x Xx Xx
Local Context
Tim Boon has signed a contract extension with Leicestershire which will keep him at Grace Road .
PER PER O O O O O O ORG O O O O O LOC LOC O
![Page 38: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/38.jpg)
MEMM Taggers Idea: left-to-right local decisions, condition on previous
tags and also entire input
Train up P(ti|w,ti-1,ti-2) as a normal maxent model, then use to score sequences
This is referred to as an MEMM tagger [Ratnaparkhi 96] Beam search effective! (Why?) What about beam size 1?
![Page 39: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/39.jpg)
Decoding Decoding MEMM taggers:
Just like decoding HMMs, different local scores Viterbi, beam search, posterior decoding
Viterbi algorithm (HMMs):
Viterbi algorithm (MEMMs):
General:
![Page 40: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/40.jpg)
Maximum Entropy II
Remember: maximum entropy objective
Problem: lots of features allow perfect fit to training set Regularization (compare to smoothing)
![Page 41: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/41.jpg)
Derivative for Maximum Entropy
Big weights are bad
Total count of feature n in correct candidates
Expected count of feature n in predicted
candidates
![Page 42: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/42.jpg)
Example: NER Regularization
Feature Type Feature PERS LOC
Previous word at -0.73 0.94
Current word Grace 0.03 0.00
Beginning bigram <G 0.45 -0.04
Current POS tag NNP 0.47 0.45
Prev and cur tags IN NNP -0.10 0.14
Previous state Other -0.70 -0.92
Current signature Xx 0.80 0.46
Prev state, cur sig O-Xx 0.68 0.37
Prev-cur-next sig x-Xx-Xx -0.69 0.37
P. state - p-cur sig O-x-Xx -0.20 0.82
…
Total: -0.58 2.68
Prev Cur Next
State Other ??? ???
Word at Grace Road
Tag IN NNP NNP
Sig x Xx Xx
Local Context
Feature WeightsBecause of regularization term, the more common prefixes have larger weights even though entire-word features are more specific.
![Page 43: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/43.jpg)
Perceptron Taggers Linear models:
… that decompose along the sequence
… allow us to predict with the Viterbi algorithm
… which means we can train with the perceptron algorithm (or related updates, like MIRA)
[Collins 01]
![Page 44: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/44.jpg)
Conditional Random Fields Make a maxent model over entire taggings
MEMM
CRF
![Page 45: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/45.jpg)
CRFs Like any maxent model, derivative is:
So all we need is to be able to compute the expectation of each feature (for example the number of times the label pair DT-NN occurs, or the number of times NN-interest occurs)
Critical quantity: counts of posterior marginals:
![Page 46: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/46.jpg)
Computing Posterior Marginals How many (expected) times is word w tagged with s?
How to compute that marginal?
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
^
N
V
J
D
$
START Fed raises interest rates END
![Page 47: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/47.jpg)
TBL Tagger
[Brill 95] presents a transformation-based tagger Label the training set with most frequent tags
DT MD VBD VBD .The can was rusted .
Add transformation rules which reduce training mistakes
MD NN : DT __ VBD VBN : VBD __ .
Stop when no transformations do sufficient good Does this remind anyone of anything?
Probably the most widely used tagger (esp. outside NLP) … but definitely not the most accurate: 96.6% / 82.0 %
![Page 48: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/48.jpg)
TBL Tagger II
What gets learned? [from Brill 95]
![Page 49: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/49.jpg)
EngCG Tagger
English constraint grammar tagger [Tapanainen and Voutilainen 94] Something else you should know
about Hand-written and knowledge driven “Don’t guess if you know” (general
point about modeling more structure!) Tag set doesn’t make all of the hard
distinctions as the standard tag set (e.g. JJ/NN)
They get stellar accuracies: 99% on their tag set
Linguistic representation matters… … but it’s easier to win when you
make up the rules
![Page 50: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/50.jpg)
Domain Effects Accuracies degrade outside of domain
Up to triple error rate Usually make the most errors on the things you care
about in the domain (e.g. protein names)
Open questions How to effectively exploit unlabeled data from a new
domain (what could we gain?) How to best incorporate domain lexica in a principled
way (e.g. UMLS specialist lexicon, ontologies)
![Page 51: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/51.jpg)
Unsupervised Tagging?
AKA part-of-speech induction Task:
Raw sentences in Tagged sentences out
Obvious thing to do: Start with a (mostly) uniform HMM Run EM Inspect results
![Page 52: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/52.jpg)
EM for HMMs: Process Alternate between recomputing distributions over hidden variables
(the tags) and reestimating parameters Crucial step: we want to tally up how many (fractional) counts of
each kind of transition and emission we have under current params:
Same quantities we needed to train a CRF!
![Page 53: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/53.jpg)
EM for HMMs: Quantities
Total path values (correspond to probabilities here):
![Page 54: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/54.jpg)
EM for HMMs: Process
From these quantities, can compute expected transitions:
And emissions:
![Page 55: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/55.jpg)
Merialdo: Setup Some (discouraging) experiments [Merialdo 94]
Setup: You know the set of allowable tags for each word Fix k training examples to their true labels
Learn P(w|t) on these examples Learn P(t|t-1,t-2) on these examples
On n examples, re-estimate with EM
Note: we know allowed tags but not frequencies
![Page 56: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/56.jpg)
Merialdo: Results
![Page 57: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/57.jpg)
Distributional Clustering
president the __ of
president the __ said
governor the __ of
governor the __ appointed
said sources __ said president __ that
reported sources __
presidentgovernor
saidreported
thea
the president said that the downturn was over
[Finch and Chater 92, Shuetze 93, many others]
![Page 58: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/58.jpg)
Distributional Clustering Three main variants on the same idea:
Pairwise similarities and heuristic clustering E.g. [Finch and Chater 92] Produces dendrograms
Vector space methods E.g. [Shuetze 93] Models of ambiguity
Probabilistic methods Various formulations, e.g. [Lee and Pereira 99]
![Page 59: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/59.jpg)
Nearest Neighbors
![Page 60: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/60.jpg)
Dendrograms _
![Page 61: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/61.jpg)
i
iiii ccPcwPCSP )|()|(),( 1
i
iiiiii cwwPcwPcPCSP )|,()|()(),( 11
A Probabilistic Version?
the president said that the downturn was over
c1 c2 c6c5 c7c3 c4 c8
the president said that the downturn was over
c1 c2 c6c5 c7c3 c4 c8
![Page 62: Statistical NLP Spring 2011](https://reader035.vdocument.in/reader035/viewer/2022062314/56814524550346895db1e9d6/html5/thumbnails/62.jpg)
What Else? Various newer ideas:
Context distributional clustering [Clark 00] Morphology-driven models [Clark 03] Contrastive estimation [Smith and Eisner 05] Feature-rich induction [Haghighi and Klein 06]
Also: What about ambiguous words? Using wider context signatures has been used for
learning synonyms (what’s wrong with this approach?)
Can extend these ideas for grammar induction (later)