Today’s Topics
• Conclude the n-gram section – (Chapter 6)
• Start Part-of-Speech (POS) Tagging section – (Chapter 8)
N-grams for Spelling Correction
• Spelling Errors– see Chapter 5
• (Kukich, 1992):– Non-word detection (easiest)
• graffe (giraffe)
– Isolated-word (context-free) error correction
• graffe (giraffe,…)• graffed (gaffed,…)• by definition cannot correct when error
word is a valid word
– Context-dependent error detection and correction (hardest)
• your an idiot you’re an idiot• (Microsoft Word corrects this by
default)
N-grams for Spelling Correction
• Context-sensitive spelling correction– real-word error detection
• when the mistyped word happens to be a real word• 15% Peterson (1986)• 25%-40% Kukich (1992)
– examples (local)• was conducted mainly be John Black• leaving in about fifteen minuets• all with continue smoothly
– examples (global)• Won’t they heave if next Monday at that time?• the system has been operating system with all four units on-line
N-grams for Spelling Correction
• Given a word sequence: – W = w1… wk … wn
• Suppose wk is mispelled• Suppose possible misspellings are w1
k, , w2k etc.
• w1k, , w2
k etc. can be estimated via edit distance operations• Find the most likely sequence
– w1… wnk … wn
– i.e. maximize p(w1… wnk … wn)
• Chain Rule– p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1)
N-grams for Spelling Correction
• Use an n-gram language model for P(W)
• Bigram– p(w1 w2 w3...wn) = p(w1) p(w2|w1)
p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1)
– p(w1 w2 w3...wn) p(w1) p(w2|w1)
p(w3|w2)...p(wn|wn-1) • Trigram
– p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|w1...wn-3wn-2wn-1)
– p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-2 wn-1)
• Problem: – we need to estimate n-grams
containing the “mispelled” items– where to find data?– acute sparse data problem
• Possible Solution • (“class-based n-grams”)
– Use a part-of-speech n-gram model
– more data– p(c1 c2 c3...cn) p(c1) p(c2|c1) p(c3|
c2)...p(cn|cn-1) (bigram)
– ci = category label • (N,V,A,P,Adv,Conj, etc.)
Entropy
• Uncertainty measure (Shannon)– given a random variable x
• r =2, pi = probability the event is i
– Biased coin: • -0.8 * lg 0.8 + -0.2 * lg 0.2 =
0.258 + 0.464 = 0.722
– Unbiased coin: • - 2* 0.5 * lg 0.5 = 1
– lg = log2 (log base 2)– entropy = H(x) = Shannon
uncertainty
€
− pi lg pi
i=1
r
∑
QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.
• Perplexity– (average) branching factor– weighted average number
of choices a random variable has to make
– Formula: 2H
• directly related to the entropy value H
• Examples– Biased coin:
• 20.722 = 0.52
– Unbiased coin: • - 21= 2
Entropy and Word Sequences
• given a word sequence: – W = w1… wn
• Entropy for word sequences of length n in language L
– H(w1… wn) = - p(w1… wn) log p(w1… wn)
– over all sequences of length n in language L
• Entropy rate for word sequences of length n
– 1/n H(w1… wn) – = -1/n p(w1… wn) log p(w1… wn)
• Entropy rate– H(L) = limn> -1/n p(w1… wn) log
p(w1… wn)
– n is number of words in the sequence
• Shannon-McMillan-Breiman theorem
– H(L) = limn→ - 1/n log p(w1… wn)
– select sufficiently large n– possible then to take a single
sequence• instead of summing over all
possible w1… wn
• long sequence will contain many shorter sequences
Cross-Entropy
• evaluate competing models
• compare two probabilistic models, i.e. approximations, m1 and p
• Compute the cross-entropy of mi on p:– H(p, mi) using
– H(p, mi) = limn → 1/n - p(w1… wn) log mi(w1… wn)
• Shannon-McMillan-Breiman version:– H(p, mi) = limn → -1/n log
mi(w1… wn)
• H(p) H(p, mi)
– true entropy is a lower bound
• mi with lowest H(p, mi) is the more accurate model
Perplexity of Language Models
• [see p228: section 6.7]• n-gram models• corpus:
– 38 million words from the WSJ (Wall Street Journal)
• Compute perplexity of each model on a test set of 1.5 million words
• Perplexity defined as– 2 H(p, mi)
• Results:– unigram 962– bigram 170– trigram 109– the lower the perplexity, the
more closely the trained model follows the data
Entropy of English
• Shannon Experiment– given a (hidden) sequence of characters– ask speaker of language to predict what the next character
might be– record the number of guesses taken to get the right
character– H(English) = -1/n p(guess = character) log p(guess = character)
• guess ranges over all characters (letters and space)• n is 27
Entropy of English
• Shannon Experiment– 1.3 bits (recorded)– URL: http://www.math.ucsd.edu/~crypto/java/ENTROPY/– “We should also mention that in a classroom of about 60 students, with
everybody venturing guesses for each next letter, we consistently obtained a value of about 1.6 bits for the estimate of the entropy.”
Entropy of English
• Word-based method– train a very good stochastic model m of English on a large
corpus– Use it to assign a log-probability to a very long sequence
• Shannon-McMillan-Breiman formula:– H(p, m) = limn → -1/n log m(w1… wn)
– H(English) H(p, m) – Result:
• 1.75 bits per character (Brown et al.)• 583 million words corpus to train model m• test sequence was the Brown corpus (1 million words)
Parts-of-Speech
• Divide words into classes based on grammatical function– nouns (open-class: unlimited set)
• referential items (denoting objects/concepts etc.)– proper nouns: John– pronouns: he, him, she, her, it– anaphors: himself, herself (reflexives)– common nouns: dog, dogs, water
» number: dog (singular), dogs (plural)» count-mass distinction: many dogs, *many waters
– eventive nouns: dismissal, concert, playback, destruction (deverbal)
• nonreferential items– it as in it is important to study– there as in there seems to be a problem– some languages don’t have these: e.g. Japanese
• open-class– factoid, email, bush-ism
Parts-of-Speech
• Pronouns:– it– I– he– you– his– they– this– that– she– her– we– all– which– their– what
figure 8.4
Parts-of-Speech
• Divide words into classes based on grammatical function– verbs (closed-class: fixed set)
• auxiliaries– be (passive, progressive)– have (pluperfect tense)– do (what did John buy?, Did Mary win?)– modals: can, could, would, will, may
• Irregular: – is, was, were, does, did
figure 8.5
Parts-of-Speech
• Divide words into classes based on grammatical function– verbs (open-class: unlimited set)
• Intransitive– unaccusatives: arrive (achievement)– unergatives: run, jog (activities)
• Transitive– actions: hit (semelfactive: hit the ball for an hour)– actions: eat, destroy (accomplishment)– psych verbs: frighten (x frightens y), fear (y fears x)
• Ditransitive– put (x put y on z, *x put y)– give (x gave y z, *x gave y, x gave z to y)– load (x loaded y (on z), x loaded z (with y))
– Open-class: • reaganize, email, fax
Parts-of-Speech
• Divide words into classes based on grammatical function– adjectives (open-class: unlimited set)
• modify nouns• black, white, open, closed, sick, well• attributive: black (black car, car is black), main (main street, *street is main),
atomic• predicative: afraid (*afraid child, the child is afraid)• stage-level: drunk (there is a man drunk in the pub)• individual-level: clever, short, tall (*there is a man tall in the bar)• object-taking: proud (proud of him,*well of him)• intersective: red (red car: intersection of the set of red things and the set of cars)• non-intersective: former (former architect), atomic (atomic scientist)• comparative, superlative: blacker, blackest, *opener, *openest
– open-class:• hackable, spammable
Parts-of-Speech
• Divide words into classes based on grammatical function– adverbs (open-class: unlimited set)
• modify verbs (adjectives and other adverbs)• manner: slowly (moved slowly)• degree: slightly, more (more clearly), very (very bad), almost• sentential: unfortunately, suddenly• question: how• temporal: when, soon, yesterday (noun?)• location: sideways, here (John is here)
– open-class:• spam-wise
Parts-of-Speech• Divide words into classes based on grammatical function
– prepositions (closed-class: fixed set)– come before an object, assigns a semantic function (from Mars, *Mars from)
• head-final languages: postpositions (Japanese: amerika-kara)
– location: on, in, by– temporal: by, until
figure 8.1
Parts-of-Speech
• Divide words into classes based on grammatical function– particles (closed-class: fixed set)– resembles a preposition or adverb, often combines to form a phrasal verb– went on, finish up– throw sleep off (throw off sleep)
• single-word particles (Quirk, 1985):
figure 8.2
Parts-of-Speech
• Divide words into classes based on grammatical function– conjunctions (closed-class: fixed set)– used to join two phrases, clauses or sentences– coordinating conjunctions: and, or, but– subordinating conjunctions: that (complementizer)
figure 8.3
Part-of-Speech (POS) Tagging
• Idea:– assign the right part-of-speech tag, e.g. noun, verb,
conjunction, to a word– useful for shallow parsing – or as first stage of a deeper/more sophisticated system
• Question:– Is it a hard task?
• i.e. can’t we look the words up in a dictionary?
• Answer:– Yes. Ambiguity.– No. POS Taggers typical claim 95%+ accuracy
Part-of-Speech (POS) Tagging
• example– walk: noun, verb
• the walk : noun I took …• I walk : verb 2 miles every day
– as a shallow parsing tool: • can we do this without fully parsing the sentence?
• example– still: noun, adjective, adverb, verb
• the still of the night, a glass still• still waters• stand still• still struggling• Still, I didn’t give way• still your fear of the dark (transitive)• the bubbling waters stilled (intransitive)
POS Tagging
• Task:– assign the right part-of-speech tag, e.g. noun, verb,
conjunction, to a word in context
• POS taggers– need to be fast in order to process large corpora
• should take no more than time linear in the size of the corpora– full parsing is slow
• e.g. context-free grammar n3, n length of the sentence– POS taggers try to assign correct tag without actually
parsing the sentence
POS Tagging
• Components:– Dictionary of words
• Exhaustive list of closed class items– Examples:
» the, a, an: determiner» from, to, of, by: preposition» and, or: coordination conjunction
• Large set of open class (e.g. noun, verbs, adjectives) items with frequency information
POS Tagging
• Components:– Mechanism to assign tags
• Context-free: by frequency• Context: bigram, trigram, HMM, hand-coded rules
– Example:» Det Noun/*Verb the walk…
– Mechanism to handle unknown words (extra-dictionary)• Capitalization• Morphology: -ed, -tion
How Hard is Tagging?
• Brown Corpus (Francis & Kucera, 1982):– 1 million words– 39K distinct words– 35K words with only 1 tag– 4K with multiple tags (DeRose, 1988)
figure 8.7
How Hard is Tagging?
• Easy task to do well on:– naïve algorithm
• assign tag by (unigram) frequency
– 90% accuracy (Charniak et al., 1993)
Penn TreeBank Tagset
• 48-tag simplification of Brown Corpus tagset• Examples:
1. CC Coordinating conjunction
3. DT Determiner
7. JJ Adjective
11. MD Modal
12. NN Noun (singular,mass)
13. NNS Noun (plural)
27 VB Verb (base form)
28 VBD Verb (past)
Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html
1 CC Coordinating conjunction2 CD Cardinal number3 DT Determiner4 EX Existential there5 FW Foreign word6 IN Preposition/subord. conjunction 7 JJ Adjective8 JJR Adjective, comparative9 JJS Adjective, superlative
10 LS List item marker11 MD Modal12 NN Noun, singular or mass13 NNS Noun, plural14 NNP Proper noun, singular15 NNPS Proper noun, plural16 PDT Predeterminer17 POS Possessive ending18 PRP Personal pronoun19 PP Possessive pronoun20 RB Adverb21 RBR Adverb, comparative22 RBS Adverb, superlative23 RP Particle24 SYM Symbol (mathematical or scientific)
Penn TreeBank Tagsetwww.ldc.upenn.edu/doc/treebank2/cl93.html
25 TO to26 UH Interjection27 VB Verb, base form28 VBD Verb, past tense29 VBG Verb, gerund/present participle30 VBN Verb, past participle31 VBP Verb, non-3rd ps. sing. present32 VBZ Verb, 3rd ps. sing. present33 WDT wh-determiner34 WP wh-pronoun35 WP Possessive wh-pronoun36 WRB wh-adverb37 # Pound sign38 $ Dollar sign39 . Sentence-final punctuation40 , Comma41 : Colon, semi-colon42 ( Left bracket character43 ) Right bracket character44 " Straight double quote45 ` Left open single quote46 " Left open double quote47 ' Right close single quote48 " Right close double quote
Penn TreeBank Tagset
• How many tags?– Tag criterion
• Distinctness with respect to grammatical behavior?
– Make tagging easier?
• Punctuation tags – Penn Treebank numbers 37- 48
• Trivial computational task
Penn TreeBank Tagset
• Simplifications:– Tag TO:
• infinitival marker, preposition• I want to win• I went to the store
– Tag IN:• preposition: that, when, although • I know that I should have stopped, although…• I stopped when I saw Bill
Penn TreeBank Tagset
• Simplifications:– Tag DT:
• determiner: any, some, these, those• any man• these *man/men
– Tag VBP: • verb, present: am, are, walk• Am I here?• *Walked I here?/Did I walk here?