albert gatt corpora and statistical methods – lecture 7

Albert Gatt Corpora and Statistical Methods Lecture 7 Slide 2 Smoothing (aka discounting) techniques Part 2 Slide 3 Overview Smoothing methods: Simple smoothing Witten-Bell & Good-Turing estimation Held-out estimation and cross-validation Combining several n-gram models: back-off models Slide 4 Rationale behind smoothing Sample frequencies seen events with probability P unseen events (including grammatical zeroes) with probability 0 Real population frequencies seen events (including the unseen events in our sample) + smoothing to approximate Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing). results in Slide 5 Laplaces law, Lidstones law and the Jeffreys-Perks law Slide 6 Instances in the Training Corpus: inferior to ________ F(w) Slide 7 Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass Slide 8 Actual Probability Distribution F(w) These are non- zero probabilities in the real distribution Slide 9 LaPlaces Law (Add-one smoothing) F(w) Slide 10 LaPlaces Law (Add-one smoothing) F(w) Slide 11 LaPlaces Law F(w) NB. This method ends up assigning most prob. mass to unseens Slide 12 Generalisation: Lidstones Law P = probability of specific n-gram C(x) = count of n-gram x in training data N = total n-grams in training data V = number of bins (possible n-grams) = small positive number M.L.E: = 0 LaPlaces Law: = 1 (add-one smoothing) Jeffreys-Perks Law: = Slide 13 Jeffreys-Perks Law F(w) Slide 14 Objections to Lidstones Law Need an a priori way to determine. Predicts all unseen events to be equally likely Gives probability estimates linear in the M.L.E. frequency Slide 15 Witten-Bell discounting Slide 16 Main intuition A zero-frequency event can be thought of as an event which hasnt happened (yet). The probability of it happening can be estimated from the probability of sth happening for the first time. The count of things which are seen only once can be used to estimate the count of things that are never seen. Slide 17 Witten-Bell method 1. T = no. of times we saw an event for the first time. = no of different n-gram types (bins) NB: T is no. of types actually attested (unlike V, the no of possible types in add- one smoothing) 2. Estimate total probability mass of unseen n-grams: each token is an event & each new type is an event so above equation gives MLE of the probability of a new type event occurring (being seen for the first time) This is the total probability mass to be distributed among all zero events (unseens) no of actual n-grams (N) + no of actual types (T) Slide 18 Witten-Bell method 3. Divide the total probability mass among all the zero n-grams. Can distribute it equally. 4. Remove this probability mass from the non-zero n-grams (discounting): Slide 19 Witten-Bell vs. Add-one If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results. The difference is with n-grams for n>1. Main idea: estimate probability of an unseen bigram from the probability of seeing a bigram starting with w1 for the first time. Slide 20 Witten-Bell with bigrams Generalised total probability mass estimate: No. bigram types beginning with w x No. bigram tokens beginning with w x Estimated total probability of bigrams starting with some word w x Slide 21 Witten-Bell with bigrams Non-zero bigrams get discounted as before, but again conditioning on history: Note: Witten-Bell wont assign the same probability mass to all unseen n- grams. The amount assigned will depend on the first word in the bigram (first n- 1 words in the n-gram).

albert gatt corpora and statistical methods – lecture 7

Documents

probability mass slide

smoothing fw slide

frequency slide

wittenbell discounting

jeffreysperks law slide

actual ngrams n

probability of sth

jeffreysperks law fw