cs b553: algorithms for optimization and learning

36
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Parameter Learning with Hidden Variables & Expectation Maximization

Upload: roxy

Post on 24-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

CS b553: Algorithms for Optimization and Learning. Parameter Learning with Hidden Variables & Expectation Maximization. Agenda. Learning probability distributions from data in the setting of known structure, missing data Expectation-maximization (EM) algorithm. Basic Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS b553: Algorithms for Optimization and  Learning

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNINGParameter Learning with Hidden Variables & Expectation Maximization

Page 2: CS b553: Algorithms for Optimization and  Learning

AGENDA

Learning probability distributions from data in the setting of known structure, missing data

Expectation-maximization (EM) algorithm

Page 3: CS b553: Algorithms for Optimization and  Learning

BASIC PROBLEM Given a dataset D={x[1],…,x[M]} and a

Bayesian model over observed variables X and hidden (latent) variables Z

Fit the distribution P(X,Z) to the data Interpretation: each example x[m] is an

incomplete view of the “underlying” sample (x[m],z[m])

Z

X

Page 4: CS b553: Algorithms for Optimization and  Learning

APPLICATIONS Clustering in data mining Dimensionality reduction Latent psychological traits (e.g., intelligence,

personality) Document classification Human activity recognition

Page 5: CS b553: Algorithms for Optimization and  Learning

HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS Hidden variables => conditional

independencesZ

X1 X2 X3 X4

X1 X2

X3 X4

Without Z, the observables becomefully dependent

Page 6: CS b553: Algorithms for Optimization and  Learning

HIDDEN VARIABLES CAN YIELD MORE PARSIMONIOUS MODELS Hidden variables => conditional

independencesZ

X1 X2 X3 X4

X1 X2

X3 X4

Without Z, the observables becomefully dependent

1+4*2=9 parameters

1+2+4+8=15 parameters

Page 7: CS b553: Algorithms for Optimization and  Learning

GENERATING MODEL

z[1]

x[1]

z[M]

x[M]

qz

qx|z

These CPTs are identical and given

These CPTs are identical and given

Page 8: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: DISCRETE VARIABLES

z[1]

x[1]

z[M]

x[M]

qz

qx|z

Categorical distributions given by parameters qz

P(Z[i] | qz) = Categorical(qz)

Categorical distribution P(X[i]|z[i],qx|z[i]) = Categorical(qx|z[i])

(in other words, z[i] multiplexes between Categorical distributions)

Page 9: CS b553: Algorithms for Optimization and  Learning

MAXIMUM LIKELIHOOD ESTIMATION Approach: find values of q = (qz, qx|z), and

DZ=(z[1],…,z[M]) that maximize the likelihood of the data

L(q, DZ ; D) = P(D|q, DZ ) Find arg max L(q, DZ ; D) over q, DZ

Page 10: CS b553: Algorithms for Optimization and  Learning

MARGINAL LIKELIHOOD ESTIMATION Approach: find values of q = (qz, qx|z), and

that maximize the likelihood of the data without assuming values of DZ=(z[1],…,z[M])

L(q; D) = SDz P(D, DZ |q) Find arg max L(q; D) over q

(A partially Bayesian approach)

Page 11: CS b553: Algorithms for Optimization and  Learning

COMPUTATIONAL CHALLENGES P(D|q, DZ ) and P(D,DZ | q) are easy to

evaluate, but…

Maximum likelihood arg max L(q, DZ ; D) Optimizing over M assignments to Z (|Val(Z)|M

possible joint assignments) as well as continuous parameters

Maximum marginal likelihood arg max L(q; D) Optimizing locally over continuous parameters,

but objective requires summing over M assignments to Z

Page 12: CS b553: Algorithms for Optimization and  Learning

EXPECTATION MAXIMIZATION FOR ML Idea: use a coordinate ascent approach arg maxq, DZ L(q, DZ ; D) =

arg maxq max DZ L(q, DZ ; D) Step 1: Finding DZ

* = arg max DZ L(q, DZ ; D) is easy given a fixed q Fully observed, ML parameter estimation

Step 2: Set Q(q) = L(q, DZ*; D) Finding

q*=arg maxq Q(q) is easy given that DZ is fixed Fully observed, ML parameter estimation

Repeat steps 1 and 2 until convergence

Page 13: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z[1]

x1[1]

z[M]

x1[M]

qz

qx1|z

x2[1]x2[M]

qx1|z

z

x1

qz

qx1|z

x2

qx1|z

M

Plate notationUnrolled network

Page 14: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Page 15: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6

Page 16: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.5qx1|z=1 = 0.4, qx1|z=2= 0.3qx2|z=1 = 0.7, qx2|z=2= 0.6

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2

Page 17: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2

Page 18: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2

qx2|z

M

Plate notation Suppose 2 types:1. X1 != X2, random2. X1,X2=1,1 with 90% chance, 0,0 otherwiseType 1 drawn 75% of the time

X Dataset• (1,1): 222• (1,0): 382• (0,1): 364• (0,0): 32

Parameter Estimatesqz = 0.604qx1|z=1 = 1, qx1|z=2= 0qx2|z=1 = 0.368, qx2|z=2= 0.919

Estimated Z’s• (1,1): type 1• (1,0): type 1• (0,1): type 2• (0,0): type 2

Converged (true ML estimate)

Page 19: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: CORRELATED VARIABLES

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation

x3qx3|z

x4qx4|z

Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03

Log likelihood -5176

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20

Page 20: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20

Z Assignmentsx3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 2 1 2 10,1 2 2 2 21,0 1 1 1 11,1 2 1 1 1

Random initial guessqZ = 0.44qX1|Z=1 = 0.97qX2|Z=1 = 0.21qX3|Z=1 = 0.87qX4|Z=1 = 0.57qX1|Z=2 = 0.07qX2|Z=2 = 0.97qX3|Z=2 = 0.71qX4|Z=2 = 0.03

Log likelihood -4401

Page 21: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: M STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20Current estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21

Log likelihood -3033

Z Assignmentsx3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 2 1 2 10,1 2 2 2 21,0 1 1 1 11,1 2 1 1 1

Page 22: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20

Z AssignmentsCurrent estimatesqZ = 0.43qX1|Z=1 = 0.67qX2|Z=1 = 0.27qX3|Z=1 = 0.37qX4|Z=1 = 0.83qX1|Z=2 = 0.31qX2|Z=2 = 0.68qX3|Z=2 = 0.31qX4|Z=2 = 0.21

Log likelihood -2965

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 2 1 2 10,1 2 2 2 11,0 1 1 1 11,1 2 1 2 1

Page 23: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: E STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20Current estimatesqZ = 0.40qX1|Z=1 = 0.56qX2|Z=1 = 0.31qX3|Z=1 = 0.40qX4|Z=1 = 0.92qX1|Z=2 = 0.45qX2|Z=2 = 0.66qX3|Z=2 = 0.26qX4|Z=2 = 0.04

Log likelihood -2859

Z Assignmentsx3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 2 1 2 10,1 2 2 2 21,0 1 1 1 11,1 2 1 1 1

Page 24: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: LAST E-M STEP

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20Current estimatesqZ = 0.43qX1|Z=1 = 0.51qX2|Z=1 = 0.36qX3|Z=1 = 0.35qX4|Z=1 = 1qX1|Z=2 = 0.53qX2|Z=2 = 0.57qX3|Z=2 = 0.33qX4|Z=2 = 0

Log likelihood -2683

Z Assignmentsx3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 2 1 2 10,1 2 1 2 11,0 2 1 2 11,1 2 1 2 1

Page 25: CS b553: Algorithms for Optimization and  Learning

PROBLEM: MANY LOCAL MINIMA Flipping Z assignments causes large shifts in

likelihood, leading to a poorly behaved energy landscape!

Solution: EM using the marginal likelihood formulation “Soft” EM (This is the typical form of the EM algorithm)

Page 26: CS b553: Algorithms for Optimization and  Learning

EXPECTATION MAXIMIZATION FOR MML arg maxq L(q, D) =

arg maxq EDZ|D,q [L(q; DZ , D)] Do arg maxq EDZ|D,q [log L(q; DZ , D)] instead

(justified later) Step 1: Given current fixed qt, find P(Dz|qt, D)

Compute a distribution over each Z[i] Step 2: Use these probabilities in the expectation

EDZ |D,qt[log L(q, DZ ; D)] = Q(q). Now find maxq Q(q) Fully observed, weighted, ML parameter

estimation Repeat steps 1 (expectation) and 2

(maximization) until convergence

Page 27: CS b553: Algorithms for Optimization and  Learning

E STEP IN DETAIL Ultimately, want to maximize Q(q | qt) = EDZ|

D,qt [log L(q; DZ , D)] over q Q(q | qt) =

Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)

E step computes the terms

wm,z(qt)=P(Z[m]=z |D, qt )

over all examples m and zVal[Z]

Page 28: CS b553: Algorithms for Optimization and  Learning

M STEP IN DETAIL arg maxq Q(q | qt) = Sm Sz wm,z(qt) log P (x[m]|q, z[m]=z)= argmax Pm Pz P (x[m]|q, z[m]=z)^(wm,z(qt))

This is weighted ML Each z[m] is interpreted to be observed wm,z(qt)

times

Most closed-form ML expressions (Bernoulli, categorial, Gaussian) can be adopted easily to weighted case

Page 29: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: BERNOULLI PARAMETER FOR Z qZ

* = arg maxqz Sm Sz wm,z log P (x[m],z[m]=z |

qZ)= arg maxqz Sm Sz wm,z log (I[z=1]qZ + I[z=0](1-qZ)= arg maxqz [log (qZ) Sm wm,z=1 + log(1-qZ)Sm wm,z=0]

=> qZ* = (Sm wm,z=1 )/ Sm (wm,z=1+ wm,z=0)

“Expected counts” Mqt[z] = Sm wm,z(qt)Express qZ

* = Mqt[z=1] / Mqt[ ]

Page 30: CS b553: Algorithms for Optimization and  Learning

EXAMPLE: BERNOULLI PARAMETERS FOR XI | Z qXi|z=k

* = arg maxqz Smwm,z=klog P(x[m],z[m]=k |qXi|z=k)

= arg maxqxi|z=k Sm Sz wm,z log (I[xi[m]=1,z=k]qXi|z=k + I[xi[m]=0,z=k](1-qXi|z=k)

= … (similar derivation)

=> qXi|z=k * = Mqt[xi=1,z=k] / Mqt[z=k]

Page 31: CS b553: Algorithms for Optimization and  Learning

EM ON PRIOR EXAMPLE (100 ITERATIONS)

z

x1

qz

qx1|z

x2qx2|z

M

Plate notation X Dataset

x3qx3|z

x4qx4|z

x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 115 142 20 470,1 32 16 37 751,0 12 117 39 581,1 133 92 45 20Final estimatesqZ = 0.49qX1|Z=1 = 0.64qX2|Z=1 = 0.88qX3|Z=1 = 0.41qX4|Z=1 = 0.46qX1|Z=2 = 0.38qX2|Z=2 = 0.00qX3|Z=2 = 0.27qX4|Z=2 = 0.68

Log likelihood -2833

P(Z)=2x3,x4

x1,x2

0,0 0,1 1,0 1,1

0,0 0.90

0.95 0.84

0.93

0,1 0.00

0.00 0.00

0.00

1,0 0.76

0.89 0.64

0.82

1,1 0.00

0.00 0.00

0.00

Page 32: CS b553: Algorithms for Optimization and  Learning

CONVERGENCE In general, no way to

tell a priori how fast EM will converge

Soft EM is usually slower than hard EM

Still runs into local minima, but has more opportunities to coordinate parameter adjustments

1 22 43 64 85 106127148169190

-4000-3500-3000-2500-2000-1500-1000

-5000

Iteration countLo

g lik

elih

ood

Page 33: CS b553: Algorithms for Optimization and  Learning

WHY DOES IT WORK? Why are we optimizing over Q(q | qt) =

Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)

rather than the true marginalized likelihood:L(q|D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?

Page 34: CS b553: Algorithms for Optimization and  Learning

WHY DOES IT WORK? Why are we optimizing over Q(q | qt) =

Sm Sz[m] P(z[m]|x[m], qt ) log P(x[m], z[m]|q)

rather than the true marginalized likelihood:L(q|D) = Pm Sz[m] P(z[m]|x[m], qt ) P(x[m], z[m]|q) ?

Can prove that: The log likelihood is increased at every step A stationary point of arg maxq EDZ|D,q [L(q; DZ , D)]

is a stationary point of log L(q|D) see K&F p882-884

Page 35: CS b553: Algorithms for Optimization and  Learning

GAUSSIAN CLUSTERING USING EM One of the first uses of EM Widely used approach Finding good starting points:

k-means algorithm (Hard assignment)

Handling degeneracies Regularization

Page 36: CS b553: Algorithms for Optimization and  Learning

RECAP Learning with hidden variables

Typically categorical