tutorial 2 (mle + language models)

14
Hypothesis testing, MLE, language models Kira Radinsky Based on some slides of Ilan Gronau, Ydo Wexler, Dan Geiger & Nir Fridman

Upload: kira

Post on 05-Dec-2014

240 views

Category:

Technology


0 download

DESCRIPTION

Part of the Search Engine course given in the Technion (2011)

TRANSCRIPT

Page 1: Tutorial 2 (mle + language models)

Hypothesis testing, MLE, language models

Kira Radinsky

Based on some slides of Ilan Gronau,

Ydo Wexler, Dan Geiger & Nir Fridman

Page 2: Tutorial 2 (mle + language models)

Hypothesis Testing

•Find the best explanation for the observed data

•Helps predict behavior of similar data sets

Page 3: Tutorial 2 (mle + language models)

An example: Binomial experiments

• Model: The unknown parameter: θ=p(H)• Data Set: series of experiment results, e.g.

D = H H T H T T T H H …• Main Assumption: each experiment is independent of

others

P(H)

P(T)

Page 4: Tutorial 2 (mle + language models)

Parameter EstimationUsing Likelihood Functions

• The likelihood of a given value for θ : LD (θ) = p(D| θ)

• Maximum Likelihood Estimation (MLE) :We wish to find a value for θ which maximizes the likelihood

• For example: The likelihood of ‘HTTHH’ is:LHTTHH (θ) = p(HTTHH | θ)= θ(1-θ)(1- θ)θ θ = θ3(1-θ)2

• We only need to know N(H) (number of Heads) and N(T) (number of Tails).

• These are sufficient statistics : LD(θ) = θN(H) (1-θ)N(T)

Page 5: Tutorial 2 (mle + language models)

Sufficient Statistics

• A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood.

• s(D) is a sufficient statistics if for any two datasets D and D’:

s(D) = s(D’ ) => LD(θ) = LD’(θ)

• Likelihood may be calculated on the statistics.

Page 6: Tutorial 2 (mle + language models)

Maximum Likelihood Estimation

• Goal: Maximize the likelihood (or log-likelihood)

• In our example:

– Lilkelihood:

• LD(θ) = θN(H) (1-θ)N(T)

– Log-Lilkelihood:

• lD(θ) = log(LD(θ)) = N(H)·log(θ) + N(T)·log(1-θ)

– Maximization of Log-Lilkelihood:

• lD‘(θ) =0:

Page 7: Tutorial 2 (mle + language models)

MLE with multiple parameters

• What if we have several parameters θ1, θ2,…, θK that we wish to learn?

• Examples:

– die toss (K=6)

– Grades (K=100)

• Sufficient statistics [assumption: a series of independent experiments]:

– N1, N2, …, NK - the number of times each outcome was observed

• Likelihood:

• MLE:

Page 8: Tutorial 2 (mle + language models)

From MLE to Bayesian Inference

• Likelihood Goal: maximize p(D| θ)

• Our Goal: maximize p(θ|D)

• Following Bayes Rule:

• Intuitively, the prior probability captures our prior knowledge (prejudice) of the model parameters.

posterior probability

Likelihood Prior probability

Page 9: Tutorial 2 (mle + language models)

MLE in Natural Language Processing (NLP)

• Goal: Evaluate the probability of the next word based on the words prior to it:

P(wi| w1,…,wi-1)

• Importance: Speech recognition, Hand written word recognition, part of speech tagging, language identification, spam detection, etc…

• Markov Assumption: The probability of a word wi in a sequence of words, depends only on the n-1 words prior to it in the sequence. n is a constant.

Page 10: Tutorial 2 (mle + language models)

N-Gram Model

• P(wi| w1,…,wi-1) = P(wi| wi-n,…,wi-1)

• Types of n-grams:

– Uni-gram

• P(wi| w1,…,wi-1) = P(wi)

– Bi-gram

• P(wi| w1,…,wi-1) = P(wi| wi-1)

– Tri-gram

• P(wi| w1,…,wi-1) = P(wi| wi-2 , wi-1)

Page 11: Tutorial 2 (mle + language models)

MLE in NLP

• Problem: How do we evaluate P(wi) , P(wi| wi-1) , P(wi| wi-2 , wi-1) ?

• Proposal: MLE

Page 12: Tutorial 2 (mle + language models)

Problems with MLE

• Many sequence of length n never appear in the dataset (but do appear in the real world).

• Example:– Task: Speech recognition. We heard a word in a sentence, and wish to decide

between two words: “Milk” and “Silk”– P(Milk | John drank) >? P(Silk | John drank) – The word “John” never appeared in the dataset, therefore we cannot decide

• Church and Gal (1991)– Dataset: 44 million words from news papers– Vocabulary: 400,653 different words– Therefore, 1.6 * 1011 possible bigrams– Very few of them appeared in the dataset….

• Solutions:Most solutions are based on some sort of smoothing:– Laplace– Good Turing

Page 13: Tutorial 2 (mle + language models)

Evaluation

• The null hypothesis, denoted by H0

• The alternative hypothesis, denoted by H1.

• Should we reject the null hypothesis in favor of the alternative?

Input:

– a value from a certain distribution

– we don't know what the parameter of that distribution is.

Test:

– How likely it is that the value we were given could have come from the distribution with this predicted parameter?

– If it's not very likely, we reject the null hypothesis in favor of the alternative.

• Critical Region

– But what exactly is "not very likely"?

– We choose a region known as the critical region. If the result of our test lies in this region, then we reject the null hypothesis in favor of the alternative.

Page 14: Tutorial 2 (mle + language models)

Empirical Evolution methods

• Divide to train and test

– Leave one out

• Cross validation

– 10 fold cross validation

– 5x2 cross validation

• Never (never never!) perform evaluation on the training data

Never!