statistical learning (from data to distributions)

• HW5 deadline extended to Friday

• Learning a probability distribution from data

• Maximum likelihood estimation (MLE)

• Maximum a posteriori (MAP) estimation

• Expectation Maximization (EM)

• Agent has made observations (data)

• Now must make sense of it (hypotheses)– Hypotheses alone may be important (e.g., in

basic science)– For inference (e.g., forecasting)– To take sensible actions (decision making)

• A basic component of economics, social and hard sciences, engineering, …

Candy Example

• Candy comes in 2 flavors, cherry and lime, with identical wrappers

• Manufacturer makes 5 (indistinguishable) bags

• Suppose we draw• What bag are we holding? What flavor will we draw next?

H1C: 100%L: 0%

H2C: 75%L: 25%

H3C: 50%L: 50%

H4C: 25%L: 75%

H5C: 0%L: 100%

Machine Learning vs. Statistics

• Machine Learning automated statistics

• This lecture– Bayesian learning, the more “traditional”

statistics (R&N 20.1-3)– Learning Bayes Nets

Bayesian Learning

• Main idea: Consider the probability of each hypothesis, given the data

• Data d:

• Hypotheses: P(hi|d)

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Using Bayes’ Rule

• P(hi|d) = P(d|hi) P(hi) is the posterior

– (Recall, 1/ = i P(d|hi) P(hi))

• P(d|hi) is the likelihood

• P(hi) is the hypothesis prior

h1C: 100%L: 0%

h2C: 75%L: 25%

h3C: 50%L: 50%

h4C: 25%L: 75%

h5C: 0%L: 100%

Computing the Posterior

• Assume draws are independent

• Let P(h1),…,P(h5) = (0.1,0.2,0.4,0.2,0.1)

• d = { 10 x }

P(d|h1) = 0P(d|h2) = 0.2510

P(d|h3) = 0.510

P(d|h4) = 0.7510

P(d|h5) = 110


P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90

Sum = 1/ = 0.1114

Posterior Hypotheses

Page 11: Statistical Learning (From data to distributions)

Predicting the Next Draw

• P(X|d) = i P(X|hi,d)P(hi|d)

= i P(X|hi)P(hi|d)

P(h1|d) =0P(h2|d) =0.00P(h3|d) =0.00P(h4|d) =0.10P(h5|d) =0.90



P(X|h1) =0P(X|h2) =0.25P(X|h3) =0.5P(X|h4) =0.75P(X|h5) =1

Probability that next candy drawn is a lime

P(X|d) = 0.975

P(Next Candy is Lime | d)

Page 13: Statistical Learning (From data to distributions)

Other properties of Bayesian Estimation

• Any learning technique trades off between good fit and hypothesis complexity

• Prior can penalize complex hypotheses– Many more complex hypotheses than simple

ones– Ockham’s razor

Hypothesis Spaces often Intractable

• A hypothesis is a joint probability table over state variables– 2n entries => hypothesis space is [0,1]^(2n)– 2^(2n) deterministic hypotheses

6 boolean variables => over 1022 hypotheses

• Summing over hypotheses is expensive!

Some Common Simplifications

• Maximum a posteriori estimation (MAP)– hMAP = argmaxhi P(hi|d)

– P(X|d) P(X|hMAP)

• Maximum likelihood estimation (ML)– hML = argmaxhi P(d|hi)

– P(X|d) P(X|hML)

• Both approximate the true Bayesian predictions as the # of data grows large

Maximum a Posteriori

• hMAP = argmaxhi P(hi|d)

• P(X|d) P(X|hMAP)

hMAP = h3 h4 h5



Maximum a Posteriori

• For large amounts of data,P(incorrect hypothesis|d) => 0

• For small sample sizes, MAP predictions are “overconfident” P(X|hMAP)


Maximum Likelihood

• hML = argmaxhi P(d|hi)

• P(X|d) P(X|hML)

hML = undefined h5



Maximum Likelihood

• hML= hMAP with uniform prior

• Relevance of prior diminishes with more data

• Preferred by some statisticians– Are priors “cheating”?– What is a prior anyway?

Advantages of MAP and MLE over Bayesian estimation

• Involves an optimization rather than a large summation– Local search techniques

• For some types of distributions, there are closed-form solutions that are easily computed

Learning Coin Flips (Bernoulli distribution)

• Let the unknown fraction of cherries be • Suppose draws are independent and

identically distributed (i.i.d)

• Observe that c out of N draws are cherries

Maximum Likelihood

• Likelihood of data d={d1,…,dN} given

– P(d|) = j P(dj|) = c (1-)N-c

i.i.d assumption Gather c cherries together, then N-c limes

Page 23: Statistical Learning (From data to distributions)

Maximum Likelihood

• Same as maximizing log likelihood

• L(d|)= log P(d|) = c log (N-c) log(1-)

• max L(d|)=> dL/d = 0=> 0 = c/ – (N-c)/(1-)=> = c/N

Maximum Likelihood for BN

• For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data


Earthquake Burglar

E 500 B: 200


P(E) = 0.5 P(B) = 0.2

A|E,B: 19/20A|B: 188/200A|E: 170/500A| : 1/380

E B P(A|E,B)

T T 0.95

F T 0.95

T F 0.34

F F 0.003

Maximum Likelihood for Gaussian Models

• Observe a continuous variable x1,…,xN

• Fit a Gaussian with mean , std – Standard procedure: write log likelihood

L = N(C – log ) – j (xj-)2/(22)

– Set derivatives to zero

Page 26: Statistical Learning (From data to distributions)

• Observe a continuous variable x1,…,xN

• Results: = 1/N xj (sample mean)2 = 1/N (xj-)2 (sample variance)

Maximum Likelihood for Gaussian Models

• Y is a child of X

• Data (xj,yj)

• X is gaussian, Y is a linear Gaussian function of X– Y(x) ~ N(ax+b,)

• ML estimate of a, b is given by least squares regression, by standard errors

Maximum Likelihood for Conditional Linear Gaussians



Back to Coin Flips

• What about Bayesian or MAP learning?

• Motivation– I pick a coin out of my pocket– 1 flip turns up heads– Whats the MLE?

Back to Coin Flips

• Need some prior distribution P()

• P(|d) = P(d|)P() = c (1-)N-c P()

Define, for all , the probability that I believe in



Page 30: Statistical Learning (From data to distributions)

MAP estimate

• Could maximize c (1-)N-c P() using some optimization

• Turns out for some families of P(), the MAP estimate is easy to compute



Beta distributions

(Conjugate prior)

Beta Distribution

• Betaa,b() = a-1 (1-)b-1

– a, b hyperparameters– is a normalization

constant– Mean at a/(a+b)

Page 32: Statistical Learning (From data to distributions)

Posterior with Beta Prior

• Posterior c (1-)N-c P()= c+a-1 (1-)N-c+b-1

• MAP estimate=(c+a)/(N+a+b)

• Posterior is also abeta distribution!– See heads, increment a– See tails, increment b– Prior specifies a “virtual count” of a heads, b tails

Does this work in general?

• Only specific distributions have the right type of prior– Bernoulli, Poisson, geometric, Gaussian,

exponential, …

• Otherwise, MAP needs a (often expensive) numerical optimization

How to deal with missing observations?

• Very difficult statistical problem in general

• E.g., surveys– Did the person not fill out political affiliation

randomly?– Or do independents do this more often than

someone with a strong affiliation?

• Better if a variable is completely hidden

Expectation Maximization for Gaussian Mixture models

Data have labels to which Gaussian they belong to, but label is a hidden variable

Clustering: N gaussian distributions

E step: compute probability a datapoint belongs to each gaussian

M step: compute ML estimates of each gaussian, weighted by the probability that each sample belongs to it

Learning HMMs

Want to find transition and observation probabilities

Data: many sequences {O1:t(j) for 1jN}

Problem: we don’t observe the X’s!

X0 X1 X2 X3

O1 O2 O3

Learning HMMs

X0 X1 X2 X3

O1 O2 O3

• Assume stationary markov chain, discrete states x1,…,xm

• Transition parametersij = P(Xt+1=xj|Xt=xi)

• Observation parametersi = P(O|Xt=xi)

• Assume stationary markov chain, discrete states x1,…,xm

• Transition parametersij = P(Xt+1=xj|Xt=xi)

• Observation parametersi = P(O|Xt=xi)

• Initial statesi = P(X0=xi)

Learning HMMs




13, 31


Expectation Maximization

• Initialize parameters randomly• E-step: infer expected probabilities of hidden

variables over time, given current parameters• M-step: maximize likelihood of data over




13, 31

323233, 1,2,3)

P(initial state) P(transition ij) P(emission)

Expectation Maximization




13, 31



E: Compute E[P(Z=z| (0),O)]

x1 x2 x3 x2 x2 x1

x1 x2 x2 x1 x3 x2

Z: all combinations of hidden sequences

Result: probability distribution over hidden state at time t

M: compute (1) = ML estimate of transition / obs. distributions

3233, 1,2,3)

Expectation Maximization




13, 31


3233, 1,2,3)Initialize

E: Compute E[P(Z=z| (0),O)]

x1 x2 x3 x2 x2 x1

x1 x2 x2 x1 x3 x2

Z: all combinations of hidden sequences

Result: probability distribution over hidden state at time t

M: compute (1) = ML estimate of transition / obs. distributions

This is the hard part…

E-Step on HMMs

• Computing expectations can be done by:– Sampling– Using the forward/backward algorithm on the

unrolled HMM (R&N pp. 546)

• The latter gives the classic Baum-Welch algorithm

• Note that EM can still get stuck in local optima or even saddle points

Next Time

• Machine learning