outline

OutlineApplications:• Spelling correctionFormal Representation:• Weighted FSTsAlgorithms:• Bayesian Inference (Noisy channel model)• Methods to determine weights

– Hand-coded– Corpus-based estimation

• Dynamic Programming– Shortest path

Detecting and Correcting Spelling Errors

Sources of lexical/spelling errors• Speech: lexical access and recognition errors (more later)• Text: typing and cognitive• OCR: recognition errorsApplications:• Spell checking• Hand-writing recognition of zip codes, signatures, GraffitiIssues:• Correct non-words in isolation (dg for dog, why not dig?)• Correcting non-words could lead to valid words

– Homophone substitution: “parents love there children”; “Lets order a desert after dinner”

– Correcting words in context

Patterns of Error

Human typists make different types of errors from OCR systems -- why?Error classification I: performance-based:• Insertion: catt• Deletion: ct• Substitution: car• Transposition: ctaError classification II: cognitive• People don’t know how to spell (nucular/nuclear; potatoe/potato)• Homonymous errors (their/there)

http://home.earthlink.net/~dcrehr/whyqwert.html

Probability: RefresherPopulation: 10 Princeton students

– What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4–That a rcs is a CS major? p(c) = 0.3–That a rcs is a vegetarian and CS major? p(c,v) = 0.2–That a vegetarian is a CS major? p(c|v) = 0.5–That a CS major is a vegetarian? p(v|c) = 0.66–That a non-CS major is a vegetarian? p(v|c’) = ??

–4 vegetarians

–3 CS majors

Bayes Rule and Noisy Channel model• We know the joint probabilities

– p(c,v) = p(c) p(v|c) (chain rule)– p(v,c) = p(c,v) = p(v) p(c|v)

• So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c).

• “Noisy channel” metaphor: channel corrupts the input; recover the original.

– think cell-phone conversations!!– Hearer’s challenge: decode what the speaker said (w), given a channel-

corrupted observation (O).

)|(maxarg* OwPwVw

)()|()()|(

vpcvpcpvcp

)(*)|(maxarg* wPwOPwVw

Source modelChannel model

How do we use this model to correct spelling errors?

• Simplifying assumptions– We only have to correct non-word errors– Each non-word (O) differs from its correct word (w) by one step

(insertion, deletion, substitution, transposition)• Generate and Test Method: (Kernighan et al 1990)

– Generate a word using one of substitution, deletion or insertion, transposition operations

– Test if the resulting word is in the dictionary.• Example:

Observation

Correct Correct letter

Error Letter

Position Type of Error

caat cat - a 2 insertion

caat carat r - 3 deletion

How do we decide which correction is most likely?

Validate the generated word in a dictionary.• But there may be multiple valid words, how to rank them?• Rank them based on a scoring function

– P(w | typo) = P(typo | w) * P(w)– Note there could be other scoring functions

• Propose n-best solutionsEstimate the likelihood P(typo|w) and the prior P(w)• count events from a corpus to estimate these probabilities• Labeled versus Unlabeled corpus• For spelling correction, what do we need?

– Word occurrence information (unlabeled corpus)– A corpus of labeled spelling errors– Approximate word replacement by local letter replacement

probabilities: Confusion matrix on letters

Cat vs Carat

Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus• cat occurs 6500 times, so p(cat) = .00013• carat occurs 3000 times, so p(carat) = .00006Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word))• suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500

times (p(-r)=.15)Scoring function: p(word|typo) = p(typo|word) * p(word)• p(cat|caat) = p(+a) * p(cat) = .1 * .00013 = .000013• p(carat|caat) = p(-r) * p(carat) = .15 * .000006 = .000009

Encoding One-Error Correction as WFSTs

Let Σ = {c,a,r,t}; One-edit model:

Dictionary model:

One-Error spelling correction:• Input ● Edit ● Dictionary

tc a

ra t

a

c:c,a:a,r:r,t:t c:c,a:a,r:r,t:tc:c,a:a,r:r,t:t:c,:a,:r,:t

c:,a:,r:,t:

c:a,c:r,c:t,a:c,a:t…

Del

0Ins

0 0

Sub

t

IssuesWhat if there are no instances of carat in corpus?• Smoothing algorithmsEstimate of P(typo|word) may not be accurate• Training probabilities on typo/word pairsWhat if there is more than one error per word?

Minimum Edit Distance

How can we measure how different one word is from another word?• How many operations will it take to transform one word into

another?caat --> cat, fplc --> fireplace (*treat abbreviations as typos??)• Levenshtein distance: smallest number of insertion, deletion, or

substitution operations that transform one string into another (ins=del=subst=1)

• Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent

Computing Levinshtein Distance

]1||,1|[|),(

)(]1,[),(]1,1[

)(],1[min],[

tsdtsLev

tinsjidtssubjid

sdeljidjid

j

ji

i

• Dynamic Programming algorithm

– Solution for a problem is a function of the solutions of subproblems

– d[i,j] contains the distance upto si and tj

– d[i,j] is computed by combining the distance of shorter substrings using insertion, deletion and substitution operations.

– optimal edit operations is recovered by storing back-pointers.

Edit Distance MatrixNB: errors

Cost=1 for insertions and deletions; Cost=2 for substitutionsRecompute the matrix: insertions=deletions=substituitions=1

http://www.cs.colorado.edu/~martin/SLP/slp-errata.html

Levenstein Distance with WFSTs

Let Σ = {c,a,r,t}; Edit model:

The two sentences to compared are encoded as FSTs.Levenstein distance between two sentences:• Dist(s1,s2) = s1 ● Edit ● s2

Subc:c,a:a,r:r,t:t

:c,:a,:r,:t

c:,a:,r:,t:

c:a,c:r,c:t,a:c,a:t…

Del

Ins

0

Spelling Correction with WFSTsDictionary: FST representation of wordsIsolated word spelling correction:• AllCorrections(w) = w ● Edit ● Dictionary• BestCorrection(w) = Bestpath (w ● Edit ● Dictionary)Spelling correction in context: “parents love there children”• S = w1, w2, … wn

• Spelling correction of wi

• Generate possible edits for wi

• Pick the edit that fits best in context• Use a n-gram language model (LM) to rank the alternatives.• “love there” vs “love their”; “there children” vs “their children”• SentenceCorrection (S) = F(S) ● Edit ● LM

• Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.

Can humans understand ‘what is meant’ as opposed to ‘what is said/written’?How?http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/

Summary

We can apply probabilistic modeling to NL problems like spell-checking• Noisy channel model, Bayesian method• Training priors and likelihoods on a corpus

Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems• e.g. Minimum Edit Distance algorithm

A number of Speech and Language tasks can be cast in this framework.• Generate alternatives using a generator• Select best/ Rank the alternatives using a model• If the generator and the model are encodable as FST

– Decoding becomes • composition followed by search for best path.

Word Classes and Tagging

Word Classes and Tagging

Words can be grouped into classes based on a number of criteria.• Application independent criterion

– Syntactic class (Nouns, Verbs, Adjectives…)– Proper names (People names, country names…)– Dates, currencies

• Application specific criterion– Product names (Ajax, Slurpee, Lexmark 3100)– Service names (7-cents plan, GoldPass)

Tagging: Categorizing words of a sentence into one of the classes.

Syntactic Classes in English: Open Class Words

Nouns: • Defined semantically: words for people, places, things• Defined syntactically: words that take determiners• Count nouns: nouns that can be counted

– One book, two computers, hundred men• Mass nouns: nouns that represent homogenous groups, can occur without

articles.– snow, salt, milk, water, hair

• Proper nouns; common nounsVerbs: words for actions and processes• Hit, love, run, fly, differ, goAdjectives: words for describing qualities and properties (modifiers) of objects• White, black, old, young, good, badAdverbs: words for describing modifiers of actions• Unfortunately, John walked home extremely slowly yesterday• Subclasses: locative (home), degree (very), manner (slowly), temporal

(yesterday)

Syntactic Classes in English: Closed Class Words

Closed Class words: • fixed set for a language• Typically high frequency words

Prepositions: relational words for describing relations among objects and events• In, on, before, by• Particles: looked up, throw out

Articles/Determiners: definite versus indefinite• Indefinite: a, an• Definite: the

Conjunctions: used to join two phrases, clauses, sentences.• Coordinating conjunctions: and, or, but• Subordinating conjunctions: that, since, because

Pronouns: shorthand to refer to objects and events.• Personal pronouns: he, she, it, they, us• Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s• Wh-pronouns: whose, what, who, whom, whomever

Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action• Tense: past, present, future• Aspect: completed or on-going• Polarity: negation• Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might)• Copula: “be” connects a subject to a predicate (John is a teacher)

Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).

Tagset

Tagset: set of tags to use; depends on the application.• Basic tags; tags with some morphology• Composition of a number of subtags

– Agglutinative languagesPopular tagsets for English• Penn Treebank Tagset: 45 tags• CLAWS tagset: 61 tags• C7 tagset: 146 tagsHow do we decide how many tags to use?• Application utility• Ease of disambiguation• Annotation consistency

– “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions– “TO” tag represents preposition “to” and infinitival marker “to read”

Supertags: fold in syntactic information into tagset• of the order of 1000 tags

Tagging: Disambiguating Words

Three different models• ENGTWOL model (Karlsson et.al. 1995)• Transformation-based model (Brill 1995)• Hidden Markov Model taggerENGTWOL tagger• Constraint-based tagger• 1,100 hand-written constraints to rule out invalid combinations of tags.

– Use of probabilistic constraints and syntactic informationTransformation-based model• Start with the most likely assignment• Make note of the context when the most likely assignment is wrong. • Induce a transformation rule that corrects the most likely assignment to the correct

tag in that context.• Rules can be seen as α β | δ – γ• Compilable into an FST

Again, the Noisy Channel Model

Input to channel: Part-of-speech sequence T• Output from channel: a word sequence W• Decoding task: find T’ = P(T|W)• Using Bayes Rule

• And since P(W) doesn’t change for any hypothetical T’• T’ = P(W|T) P(T) • P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual

Probability

Source Noisy Channel Decoder

maxargVT

)()()|(maxarg WPTPTWP

VT

maxargVT

Stochastic Tagging: Markov Assumption

• The tagging model is approximated using Markov assumptions.– T’ = P(T) * P(W|T)– Markov (first-order) assumption: – Independence assumption:– Thus:

• The probability distributions are estimated from an annotated corpus.– Maximum Likelihood Estimate

• P(w|t) = count(w,t)/count(t)• P(ti|ti-1) = count(ti, ti-1)/count(ti-1)• Don’t forget to smooth the counts!!

– There are other means of estimating these probabilities.

maxargVT

i

ii ttPTP )|()( 1

i

ii twPTWP )|()|()|(*)|(maxarg' 1

iii

iiVT

ttPtwPT

Best Path Search

Search for the best path pervades many Speech and NLP problems.• ASR: best path through a composition of acoustic, pronunciation and

language models• Tagging: best path through a composition of lexicon and contextual

model• Edit distance: best path through a search space set up by insertion,

deletion and substitution operations.In general: • Decisions/operations create a weighted search space• Search for the best sequence of decisions Dynamic programming solution• Sometimes the score is only relevant.• Most often the path (sequence of states; derivation) is relevant.

Multi-stage decision problems

Find the state sequence through this space that maximizes P(w|t)*P(t|t-1)cost(BOS, EOS) = 1*cost(DT, EOS)cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS), P(the|DT)*P(VB|DT)*cost(VB,EOS)}

DT •VB VBZ

NN NNS

The dog runs .•

BOS EOS

Two ways of reasoning

Forward approach (Backward reasoning)• Compute the best way to get from a state to the goal state.Backward approach (Forward reasoning)• Compute the best way from the source state to get to a

state.A combination of these two approaches is used in unsupervised training of HMMs.• Forward-backward algorithm (Appendix D)

outline

Documents

correct word w

noncs major

nonword errorseach nonword

resulting word

generated word

correct nonwords

likelihood pvc

prior probabilities