ling/c’sc’581:’’sandiway/ling581-15/... · 2015-01-22 ·...
TRANSCRIPT
LING/C SC 581: Advanced Computa9onal Linguis9cs
Lecture Notes Jan 22nd
Today's Topics
• Minimum Edit Distance Homework
• Corpora: frequency informa9on
• tregex
Minimum Edit Distance Homework
• Background: – … about 20% of the -me “Britney Spears” is misspelled when people search for it on Google
• SoOware for genera9ng misspellings – If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings.
– hRp://www.geneffects.com/typoposi9ve/
Minimum Edit Distance Homework
• hRp://www.google.com/jobs/archive/britney.html
Top six misspellings
• Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): – e.g. ED(briRany) < ED(britany)
Minimum Edit Distance Homework
• Submit your homework in PDF – how many you got right – explain your criteria, e.g. weights, chosen
• you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well
• due by email to me before next Thursday class… – put your name and 581 at the top of your submission
Part 2
• Corpora: frequency informa9on
• Unlabeled corpus: just words • Labeled corpus: various kinds … – POS informa9on – Informa9on about phrases – Word sense or Seman9c role labeling
easy to find
progressively harder to create or obtain
Language Models and N-‐grams • given a word sequence
– w1 w2 w3 ... wn • chain rule
– how to compute the probability of a sequence of words – p(w1 w2) = p(w1) p(w2|w1) – p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) – ... – p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-‐2 wn-‐1)
• note – It’s not easy to collect (meaningful) sta9s9cs on p(wn|wn-‐1wn-‐2...w1) for all
possible word sequences
Language Models and N-‐grams • Given a word sequence
– w1 w2 w3 ... wn • Bigram approxima8on
– just look at the previous word only (not all the proceedings words) – Markov Assump8on: finite length history – 1st order Markov Model – p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-‐3wn-‐2wn-‐1)
– p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1)
• note – p(wn|wn-‐1) is a lot easier to collect data for (and thus es9mate well) than p(wn|w1...wn-‐2
wn-‐1)
Language Models and N-‐grams • Trigram approxima8on
– 2nd order Markov Model – just look at the preceding two words only – p(w1 w2 w3 w4...wn) = p(w1) p(w2|w1) p(w3|w1w2) p(w4|w1w2w3)...p(wn|
w1...wn-‐3wn-‐2wn-‐1)
– p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w1w2)p(w4|w2w3)...p(wn |wn-‐2 wn-‐1)
• note – p(wn|wn-‐2wn-‐1) is a lot easier to es9mate well than p(wn|w1...wn-‐2 wn-‐1) but
harder than p(wn|wn-‐1 )
Language Models and N-‐grams
• es8ma8ng from corpora – how to compute bigram probabili-es – p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1w) w is any word
– Since f(wn-‐1w) = f(wn-‐1) f(wn-‐1) = unigram frequency for wn-‐1
– p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) rela8ve frequency
• Note: – The technique of es9ma9ng (true) probabili9es using a rela9ve
frequency measure over a training corpus is known as maximum likelihood es8ma8on (MLE)
Mo9va9on for smoothing • Smoothing: avoid zero probability es-mates • Consider • what happens when any individual probability component is
zero? – Arithme8c mul8plica8on law: 0×X = 0 – very bri>le!
• even in a very large corpus, many possible n-‐grams over vocabulary space will have zero frequency – par-cularly so for larger n-‐grams
p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1)
Language Models and N-‐grams
• Example:
unigram frequencies
wn-‐1wn bigram frequencies
bigram probabili9es
sparse matrix
zeros render probabili9es unusable
(we’ll need to add fudge factors -‐ i.e. do smoothing)
wn-‐1
wn
Smoothing and N-‐grams • sparse dataset means zeros are a problem
– Zero probabili9es are a problem • p(w1 w2 w3...wn) ≈ p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-‐1) bigram model
• one zero and the whole product is zero – Zero frequencies are a problem
• p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) rela8ve frequency
• bigram f(wn-‐1wn) doesn’t exist in dataset
• smoothing – refers to ways of assigning zero probability n-‐grams a non-‐zero value
Smoothing and N-‐grams • Add-‐One Smoothing (4.5.1 Laplace Smoothing)
– add 1 to all frequency counts – simple and no more zeros (but there are bePer methods)
• unigram – p(w) = f(w)/N (before Add-‐One)
• N = size of corpus – p(w) = (f(w)+1)/(N+V) (with Add-‐One) – f*(w) = (f(w)+1)*N/(N+V) (with Add-‐One)
• V = number of dis9nct words in corpus • N/(N+V) normaliza9on factor adjus9ng for the effec9ve increase in the corpus size caused by
Add-‐One • bigram
– p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1) (before Add-‐One) – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) (aRer Add-‐One) – f*(wn-‐1 wn) = (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V) (aRer Add-‐One)
must rescale so that total probability mass stays at 1
Smoothing and N-‐grams
• Add-‐One Smoothing – add 1 to all frequency counts
• bigram – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) – (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V)
• frequencies
Remarks: perturba-on problem add-‐one causes large changes in some frequencies due to rela9ve size of V (1616) want to: 786 ⇒ 338
I want to eat Chinese food lunchI 8 1087 0 13 0 0 0want 3 0 786 0 6 8 6to 3 0 10 860 3 0 12eat 0 0 2 0 19 2 52Chinese 2 0 0 0 0 120 1food 19 0 17 0 0 0 0lunch 4 0 0 0 0 1 0
I want to eat Chinese food lunchI 6.12 740.05 0.68 9.52 0.68 0.68 0.68want 1.72 0.43 337.76 0.43 3.00 3.86 3.00to 2.67 0.67 7.35 575.41 2.67 0.67 8.69eat 0.37 0.37 1.10 0.37 7.35 1.10 19.47Chinese 0.35 0.12 0.12 0.12 0.12 14.09 0.23food 9.65 0.48 8.68 0.48 0.48 0.48 0.48lunch 1.11 0.22 0.22 0.22 0.22 0.44 0.22
= figure 6.8
= figure 6.4
Smoothing and N-‐grams
• Add-‐One Smoothing – add 1 to all frequency counts
• bigram – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V) – (f(wn-‐1 wn)+1)* f(wn-‐1) /(f(wn-‐1)+V)
• Probabili8es
Remarks: perturba-on problem similar changes in probabili9es
I want to eat Chinese food lunchI 0.00178 0.21532 0.00020 0.00277 0.00020 0.00020 0.00020want 0.00141 0.00035 0.27799 0.00035 0.00247 0.00318 0.00247to 0.00082 0.00021 0.00226 0.17672 0.00082 0.00021 0.00267eat 0.00039 0.00039 0.00117 0.00039 0.00783 0.00117 0.02075Chinese 0.00164 0.00055 0.00055 0.00055 0.00055 0.06616 0.00109food 0.00641 0.00032 0.00577 0.00032 0.00032 0.00032 0.00032lunch 0.00241 0.00048 0.00048 0.00048 0.00048 0.00096 0.00048
I want to eat Chinese food lunchI 0.00233 0.31626 0.00000 0.00378 0.00000 0.00000 0.00000want 0.00247 0.00000 0.64691 0.00000 0.00494 0.00658 0.00494to 0.00092 0.00000 0.00307 0.26413 0.00092 0.00000 0.00369eat 0.00000 0.00000 0.00213 0.00000 0.02026 0.00213 0.05544Chinese 0.00939 0.00000 0.00000 0.00000 0.00000 0.56338 0.00469food 0.01262 0.00000 0.01129 0.00000 0.00000 0.00000 0.00000lunch 0.00871 0.00000 0.00000 0.00000 0.00000 0.00218 0.00000
= figure 6.5
= figure 6.7
Smoothing and N-‐grams
• let’s illustrate the problem – take the bigram case: – wn-‐1wn
– p(wn|wn-‐1) = f(wn-‐1wn)/f(wn-‐1)
– suppose there are cases – wn-‐1wzero
1 that don’t occur in the corpus
probability mass
f(wn-‐1)
f(wn-‐1wn)
f(wn-‐1wzero1)=0
f(wn-‐1wzerom)=0
...
Smoothing and N-‐grams
• add-‐one – “give everyone 1”
probability mass
f(wn-‐1)
f(wn-‐1wn)+1
f(wn-‐1w01)=1
f(wn-‐1w0m)=1
...
Smoothing and N-‐grams
• add-‐one – “give everyone 1”
probability mass
f(wn-‐1)
f(wn-‐1wn)+1
f(wn-‐1w01)=1
f(wn-‐1w0m)=1
... V = |{wi}|
• redistribu-on of probability mass – p(wn|wn-‐1) = (f(wn-‐1wn)+1)/(f(wn-‐1)+V)
Smoothing and N-‐grams
• Good-‐Turing Discoun8ng (4.5.2) – Nc = number of things (= n-‐grams) that occur c 9mes in the corpus – N = total number of things seen – Formula: smoothed c for Nc given by c* = (c+1)Nc+1/Nc
– Idea: use frequency of things seen once to es9mate frequency of things we haven’t seen yet – es9mate N0 in terms of N1… – and so on but if Nc =0, smooth that first using something like log(Nc)=a+b log(c) – Formula: P*(things with zero freq) = N1/N – smaller impact than Add-‐One
• Textbook Example: – Fishing in lake with 8 species
• bass, carp, cawish, eel, perch, salmon, trout, whitefish – Sample data (6 out of 8 species):
• 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel
– P(unseen new fish, i.e. bass or carp) = N1/N = 3/18 = 0.17 – P(next fish=trout) = 1/18
• (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) – revised count for trout: c*(trout) = 2*N2/N1=2(1/3)=0.67 (discounted from 1) – revised P(next fish=trout) = 0.67/18 = 0.037
Language Models and N-‐grams
• N-‐gram models + smoothing – one consequence of smoothing is that – every possible concatenta9on or sequence of words has a non-‐zero probability
– N-‐gram models can also incorporate word classes, e.g. POS labels when available
Language Models and N-‐grams
• N-‐gram models – data is easy to obtain • any unlabeled corpus will do
– they’re technically easy to compute • count frequencies and apply the smoothing formula
– but just how good are these n-‐gram language models?
– and what can they show us about language?
Language Models and N-‐grams
approxima8ng Shakespeare – generate random sentences using n-‐grams – Corpus: Complete Works of Shakespeare
• Unigram (pick random, unconnected words)
• Bigram
Language Models and N-‐grams
• Approxima9ng Shakespeare – generate random sentences using n-‐grams – Corpus: Complete Works of Shakespeare
• Trigram
• Quadrigram
Remarks: dataset size problem training set is small 884,647 words 29,066 different words
29,0662 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible con9nua9ons, which means program can’t be very innova9ve for higher n
Language Models and N-‐grams
• A limita9on: – produces ungramma9cal sequences
• Treebank: – poten9al to be a beRer language model – Structural informa9on: • contains frequency informa9on about syntac9c rules
– we should be able to generate sequences that are closer to English …
Language Models and N-‐grams
• Aside: hRp://hemispheresmagazine.com/contests/2004/intro.htm
Part 3
tregex • I assume everyone has:
1. Installed Penn Treebank v3 2. Downloaded and installed tregex
Trees in the Penn Treebank
Nota8on: LISP S-‐expression
Directory: TREEBANK_3/parsed/mrg/
tregex • Search Example: << dominates, < immediately dominates
tregex Help
tregex Help
tregex • Help: tregex expression syntax is non-‐standard wrt bracke-ng
S < VP S < NP
tregex
• Help: tregex boolean syntax is also non-‐standard
tregex
• Help
tregex • Help
tregex • PaRern:
– (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <-‐ =comma)
Key: <, first child $+ immediate leO sister <-‐ last child
same node
tregex
• Help
tregex
tregex • Different results from:
– @SBAR < /^WH.*-‐([0-‐9]+)$/#1%index << (@NP < (/^-‐NONE-‐/ < /^\*T\*-‐([0-‐9]+)$/#1%index))
tregex
Example: WHADVP also possible (not just WHNP)
Treebank Guides 1. Tagging Guide 2. Arpa94 paper 3. Parse Guide
Treebank Guides
• Parts-‐of-‐speech (POS) Tagging Guide, tagguid1.pdf (34 pages):
tagguid2.pdf: addendum, see POS tag ‘TO’
Treebank Guides
• Parsing guide 1, prsguid1.pdf (318 pages):
prsguid2.pdf: addendum for the Switchboard corpus