outline
DESCRIPTION
Outline. Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights Hand-coded Corpus-based estimation Dynamic Programming Shortest path . Detecting and Correcting Spelling Errors. - PowerPoint PPT PresentationTRANSCRIPT
OutlineApplications:• Spelling correctionFormal Representation:• Weighted FSTsAlgorithms:• Bayesian Inference (Noisy channel model)• Methods to determine weights
– Hand-coded– Corpus-based estimation
• Dynamic Programming– Shortest path
Detecting and Correcting Spelling Errors
Sources of lexical/spelling errors• Speech: lexical access and recognition errors (more later)• Text: typing and cognitive• OCR: recognition errorsApplications:• Spell checking• Hand-writing recognition of zip codes, signatures, GraffitiIssues:• Correct non-words in isolation (dg for dog, why not dig?)• Correcting non-words could lead to valid words
– Homophone substitution: “parents love there children”; “Lets order a desert after dinner”
– Correcting words in context
Patterns of Error
Human typists make different types of errors from OCR systems -- why?Error classification I: performance-based:• Insertion: catt• Deletion: ct• Substitution: car• Transposition: ctaError classification II: cognitive• People don’t know how to spell (nucular/nuclear; potatoe/potato)• Homonymous errors (their/there)
Probability: RefresherPopulation: 10 Princeton students
– What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4–That a rcs is a CS major? p(c) = 0.3–That a rcs is a vegetarian and CS major? p(c,v) = 0.2–That a vegetarian is a CS major? p(c|v) = 0.5–That a CS major is a vegetarian? p(v|c) = 0.66–That a non-CS major is a vegetarian? p(v|c’) = ??
–4 vegetarians
–3 CS majors
Bayes Rule and Noisy Channel model• We know the joint probabilities
– p(c,v) = p(c) p(v|c) (chain rule)– p(v,c) = p(c,v) = p(v) p(c|v)
• So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c).
• “Noisy channel” metaphor: channel corrupts the input; recover the original.
– think cell-phone conversations!!– Hearer’s challenge: decode what the speaker said (w), given a channel-
corrupted observation (O).
)|(maxarg* OwPwVw
)()|()()|(
vpcvpcpvcp
)(*)|(maxarg* wPwOPwVw
Source modelChannel model
How do we use this model to correct spelling errors?
• Simplifying assumptions– We only have to correct non-word errors– Each non-word (O) differs from its correct word (w) by one step
(insertion, deletion, substitution, transposition)• Generate and Test Method: (Kernighan et al 1990)
– Generate a word using one of substitution, deletion or insertion, transposition operations
– Test if the resulting word is in the dictionary.• Example:
Observation
Correct Correct letter
Error Letter
Position Type of Error
caat cat - a 2 insertion
caat carat r - 3 deletion
How do we decide which correction is most likely?
Validate the generated word in a dictionary.• But there may be multiple valid words, how to rank them?• Rank them based on a scoring function
– P(w | typo) = P(typo | w) * P(w)– Note there could be other scoring functions
• Propose n-best solutionsEstimate the likelihood P(typo|w) and the prior P(w)• count events from a corpus to estimate these probabilities• Labeled versus Unlabeled corpus• For spelling correction, what do we need?
– Word occurrence information (unlabeled corpus)– A corpus of labeled spelling errors– Approximate word replacement by local letter replacement
probabilities: Confusion matrix on letters
Cat vs Carat
Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus• cat occurs 6500 times, so p(cat) = .00013• carat occurs 3000 times, so p(carat) = .00006Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word))• suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500
times (p(-r)=.15)Scoring function: p(word|typo) = p(typo|word) * p(word)• p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013• p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009
Encoding One-Error Correction as WFSTs
Let Σ = {c,a,r,t}; One-edit model:
Dictionary model:
One-Error spelling correction:• Input ● Edit ● Dictionary
tc a
ra t
a
c:c,a:a,r:r,t:t c:c,a:a,r:r,t:tc:c,a:a,r:r,t:t:c,:a,:r,:t
c:,a:,r:,t:
c:a,c:r,c:t,a:c,a:t…
Del
0Ins
0 0
Sub
t
IssuesWhat if there are no instances of carat in corpus?• Smoothing algorithmsEstimate of P(typo|word) may not be accurate• Training probabilities on typo/word pairsWhat if there is more than one error per word?
Minimum Edit Distance
How can we measure how different one word is from another word?• How many operations will it take to transform one word into
another?caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)• Levenshtein distance: smallest number of insertion, deletion, or
substitution operations that transform one string into another (ins=del=subst=1)
• Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent
Computing Levinshtein Distance
]1||,1|[|),(
)(]1,[),(]1,1[
)(],1[min],[
tsdtsLev
tinsjidtssubjid
sdeljidjid
j
ji
i
• Dynamic Programming algorithm
– Solution for a problem is a function of the solutions of subproblems
– d[i,j] contains the distance upto si and tj
– d[i,j] is computed by combining the distance of shorter substrings using insertion, deletion and substitution operations.
– optimal edit operations is recovered by storing back-pointers.
Edit Distance MatrixNB: errors
Cost=1 for insertions and deletions; Cost=2 for substitutionsRecompute the matrix: insertions=deletions=substituitions=1
Levenstein Distance with WFSTs
Let Σ = {c,a,r,t}; Edit model:
The two sentences to compared are encoded as FSTs.Levenstein distance between two sentences:• Dist(s1,s2) = s1 ● Edit ● s2
Subc:c,a:a,r:r,t:t
:c,:a,:r,:t
c:,a:,r:,t:
c:a,c:r,c:t,a:c,a:t…
Del
Ins
0
Spelling Correction with WFSTsDictionary: FST representation of wordsIsolated word spelling correction:• AllCorrections(w) = w ● Edit ● Dictionary• BestCorrection(w) = Bestpath (w ● Edit ● Dictionary)Spelling correction in context: “parents love there children”• S = w1, w2, … wn
• Spelling correction of wi
• Generate possible edits for wi
• Pick the edit that fits best in context• Use a n-gram language model (LM) to rank the alternatives.• “love there” vs “love their”; “there children” vs “their children”• SentenceCorrection (S) = F(S) ● Edit ● LM
• Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.
Can humans understand ‘what is meant’ as opposed to ‘what is said/written’?How?http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/
Summary
We can apply probabilistic modeling to NL problems like spell-checking• Noisy channel model, Bayesian method• Training priors and likelihoods on a corpus
Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems• e.g. Minimum Edit Distance algorithm
A number of Speech and Language tasks can be cast in this framework.• Generate alternatives using a generator• Select best/ Rank the alternatives using a model• If the generator and the model are encodable as FST
– Decoding becomes • composition followed by search for best path.
Word Classes and Tagging
Word Classes and Tagging
Words can be grouped into classes based on a number of criteria.• Application independent criterion
– Syntactic class (Nouns, Verbs, Adjectives…)– Proper names (People names, country names…)– Dates, currencies
• Application specific criterion– Product names (Ajax, Slurpee, Lexmark 3100)– Service names (7-cents plan, GoldPass)
Tagging: Categorizing words of a sentence into one of the classes.
Syntactic Classes in English: Open Class Words
Nouns: • Defined semantically: words for people, places, things• Defined syntactically: words that take determiners• Count nouns: nouns that can be counted
– One book, two computers, hundred men• Mass nouns: nouns that represent homogenous groups, can occur without
articles.– snow, salt, milk, water, hair
• Proper nouns; common nounsVerbs: words for actions and processes• Hit, love, run, fly, differ, goAdjectives: words for describing qualities and properties (modifiers) of objects• White, black, old, young, good, badAdverbs: words for describing modifiers of actions• Unfortunately, John walked home extremely slowly yesterday• Subclasses: locative (home), degree (very), manner (slowly), temporal
(yesterday)
Syntactic Classes in English: Closed Class Words
Closed Class words: • fixed set for a language• Typically high frequency words
Prepositions: relational words for describing relations among objects and events• In, on, before, by• Particles: looked up, throw out
Articles/Determiners: definite versus indefinite• Indefinite: a, an• Definite: the
Conjunctions: used to join two phrases, clauses, sentences.• Coordinating conjunctions: and, or, but• Subordinating conjunctions: that, since, because
Pronouns: shorthand to refer to objects and events.• Personal pronouns: he, she, it, they, us• Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s• Wh-pronouns: whose, what, who, whom, whomever
Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action• Tense: past, present, future• Aspect: completed or on-going• Polarity: negation• Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might)• Copula: “be” connects a subject to a predicate (John is a teacher)
Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).
Tagset
Tagset: set of tags to use; depends on the application.• Basic tags; tags with some morphology• Composition of a number of subtags
– Agglutinative languagesPopular tagsets for English• Penn Treebank Tagset: 45 tags• CLAWS tagset: 61 tags• C7 tagset: 146 tagsHow do we decide how many tags to use?• Application utility• Ease of disambiguation• Annotation consistency
– “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions– “TO” tag represents preposition “to” and infinitival marker “to read”
Supertags: fold in syntactic information into tagset• of the order of 1000 tags
Tagging: Disambiguating Words
Three different models• ENGTWOL model (Karlsson et.al. 1995)• Transformation-based model (Brill 1995)• Hidden Markov Model taggerENGTWOL tagger• Constraint-based tagger• 1,100 hand-written constraints to rule out invalid combinations of tags.
– Use of probabilistic constraints and syntactic informationTransformation-based model• Start with the most likely assignment• Make note of the context when the most likely assignment is wrong. • Induce a transformation rule that corrects the most likely assignment to the correct
tag in that context.• Rules can be seen as α β | δ – γ• Compilable into an FST
Again, the Noisy Channel Model
Input to channel: Part-of-speech sequence T• Output from channel: a word sequence W• Decoding task: find T’ = P(T|W)• Using Bayes Rule
• And since P(W) doesn’t change for any hypothetical T’• T’ = P(W|T) P(T) • P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual
Probability
Source Noisy Channel Decoder
maxargVT
)()()|(maxarg WPTPTWP
VT
maxargVT
Stochastic Tagging: Markov Assumption
• The tagging model is approximated using Markov assumptions.– T’ = P(T) * P(W|T)– Markov (first-order) assumption: – Independence assumption:– Thus:
• The probability distributions are estimated from an annotated corpus.– Maximum Likelihood Estimate
• P(w|t) = count(w,t)/count(t)• P(ti|ti-1) = count(ti, ti-1)/count(ti-1)• Don’t forget to smooth the counts!!
– There are other means of estimating these probabilities.
maxargVT
i
ii ttPTP )|()( 1
i
ii twPTWP )|()|()|(*)|(maxarg' 1
iii
iiVT
ttPtwPT
Best Path Search
Search for the best path pervades many Speech and NLP problems.• ASR: best path through a composition of acoustic, pronunciation and
language models• Tagging: best path through a composition of lexicon and contextual
model• Edit distance: best path through a search space set up by insertion,
deletion and substitution operations.In general: • Decisions/operations create a weighted search space• Search for the best sequence of decisions Dynamic programming solution• Sometimes the score is only relevant.• Most often the path (sequence of states; derivation) is relevant.
Multi-stage decision problems
DT •VB VBZ
NN NNS
The dog runs .•
P(dog|NN) = 0.99
P(dog|VB) = 0.01
P(the|DT) = 0.999
P(runs|NNS) = 0.63
P(runs|VBZ) = 0.37
P( | ) = 0.999• •
P(DT|BOS) =1
P(NN|DT) = 0.9
P(VB|DT) = 0.1
P(NNS|NN) = 0.3
P(VBZ|NN) = 0.7
P( |NNS) = 0.3
P( |VBZ) = 0.7
P(EOS | ) = 1
••
•
BOS EOS
P(NNS|VB) = 0.7
P(VBZ|VB) = 0.3
Multi-stage decision problems
Find the state sequence through this space that maximizes P(w|t)*P(t|t-1)cost(BOS, EOS) = 1*cost(DT, EOS)cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS), P(the|DT)*P(VB|DT)*cost(VB,EOS)}
DT •VB VBZ
NN NNS
The dog runs .•
BOS EOS
Two ways of reasoning
Forward approach (Backward reasoning)• Compute the best way to get from a state to the goal state.Backward approach (Forward reasoning)• Compute the best way from the source state to get to a
state.A combination of these two approaches is used in unsupervised training of HMMs.• Forward-backward algorithm (Appendix D)