algorithms aspect of machine learning, lecture 4a · karl pearson (1894) and the naples crabs (...

33
Disentangling Gaussians Ankur Moitra, MIT November 6th, 2014 — Dean’s Breakfast Ankur Moitra (MIT) Disentangling Gaussians November 6th, 2014 1 Algorithmic Aspects of Machine Learning © 2015 by Ankur Moitra. Note: These are unpolished, incomplete course notes. Developed for educational use at MIT and for publication through MIT OpenCourseware.

Upload: others

Post on 14-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Disentangling Gaussians

Ankur Moitra, MIT

November 6th, 2014 — Dean’s Breakfast

Ankur Moitra (MIT) Disentangling Gaussians November 6th, 20141

Algorithmic Aspects of Machine Learning© 2015 by Ankur Moitra.Note: These are unpolished, incomplete course notes.Developed for educational use at MIT and for publication through MIT OpenCourseware.

Page 2: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

The Gaussian Distribution

The Gaussian distribution is defined as (µ = mean, σ2 = variance):

N (µ, σ2 1, x) = √

2πσ2e−

(x−µ)2

22σ

Central Limit Theorem: The sum of independent random variablesX1,X2, ....,Xs converges in distribution to a Gaussian:

1√∑s

Xi →d N (µ, σ2)s

i=1

This distribution is ubiquitous — e.g. used to model height,velocities in an ideal gas, annual rainfall, ...

2

Page 3: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Karl Pearson (1894) and the Naples Crabs

(figure due to Peter Macdonald)

0.58 0.60 0.62 0.64 0.66 0.68 0.70

05

1015

20

3

Image courtesy of Peter D. M. Macdonald. Used with permission.

Page 4: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Gaussian Mixture Models

F (x) = w1F1(x) + (1− w1)F 22(x), where Fi(x) = N (µi , σi , x)

In particular, with probability w1 output a sample from F1, otherwiseoutput a sample from F2

five unknowns: w1, µ1, σ21, µ2, σ

22

Question

Given enough random samples from F , can we learn these parameters(approximately)?

Pearson invented the method of moments, to attack this problem...

4

Page 5: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Pearson’s Sixth Moment Test

Claim

Ex←F (x)[xr ] is a polynomial in θ = (w 2 2

1, µ1, σ1, µ2, σ2)

In particular:

Ex←F (x)[x ] = w1µ1 + (1− w1)µ2

Ex←F (x)[x2] = w1(µ2

1 + σ21) + (1− w1)(µ2

2 + σ22)

E [x3] = w (µ3 + 3µ σ2) + (1− w )(µ3 2x←F (x) 1 1 1 1 1 2 + 3µ2σ2)

...

5

Page 6: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Pearson’s Sixth Moment Test

Claim

E r 2 2x←F (x)[x ] is a polynomial in θ = (w1, µ1, σ1, µ2, σ2)

Let Ex←F (x)[xr ] = Mr (θ)

Gather samples S

Set Mr = 1|S | i∈S x

ri for r = 1, 2, ..., 6

Compute simultaneous roots of {Mr (θ) = Mr}r=1,2,...,5, selectroot θ that is closest in sixth moment

6

Page 7: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Provable Guarantees?

In Contributions to the Mathematical Theory of Evolution (attributedto George Darwin):

“Given the probable error of every ordinate of a frequencycurve, what are the probable errors of the elements of thetwo normal curves into which it may be dissected?”

Are the parameters of a mixture of two Gaussians uniquelydetermined by its moments?

Are these polynomial equations robust to errors?

7

Page 8: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

A View from Theoretical Computer Science

Suppose our goal is to provably learn the parameters of eachcomponent within an additive ε:

Goal

Output a mixture F = w 1F1 + w

2F2 so that there is a permutationπ : {1, 2} → {1, 2} and for i = {1, 2}

|wi − wπ(i)|, |µi − µπ(i)|, |σ2i − σ2

π(i)| ≤ ε

Is there an algorithm that takes poly(1/ε) samples and runs in timepoly(1/ε)?

8

Page 9: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

A Conceptual History

Pearson (1894): Method of Moments (no guarantees)

Fisher (1912-1922): Maximum Likelihood Estimator (MLE)

consistent and efficient, usually computationally hard

requires many samples

Teicher (1961): Identifiability through tails

Dempster, Laird, Rubin (1977): Expectation-Maximization (EM)

gets stuck in local maxima

Dasgupta (1999) and many others: Clustering

assumes almost non-overlapping components

9

Page 10: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Identifiability through the Tails

Approach: Find the parameters of the component with largestvariance (it dominates the behavior of F (x) at infinity); subtract itoff and continue

10

Page 11: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

A Conceptual History

Pearson (1894): Method of Moments (no guarantees)

Fisher (1912-1922): Maximum Likelihood Estimator (MLE)

consistent and efficient, usually computationally hard

Teicher (1961): Identifiability through tails

requires many samples

Dempster, Laird, Rubin (1977): Expectation-Maximization (EM)

gets stuck in local maxima

assumes almost non-overlapping components

Dasgupta (1999) and many others: Clustering

11

Page 12: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Clustering Well-separated Mixtures

ignore

Approach: Cluster samples based on which component generatedthem; output the empirical mean and variance of each cluster

12

Page 13: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

A Conceptual History

Pearson (1894): Method of Moments (no guarantees)

Fisher (1912-1922): Maximum Likelihood Estimator (MLE)

consistent and efficient, usually computationally hard

Teicher (1961): Identifiability through tails

requires many samples

Dempster, Laird, Rubin (1977): Expectation-Maximization (EM)

gets stuck in local maxima

Dasgupta (1999) and many others: Clustering

assumes almost non-overlapping components

13

Page 14: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

In summary, these approaches are heuristic, computationallyintractable or make a separation assumption about the mixture

Question

What if the components overlap almost entirely?

[Kalai, Moitra, Valiant] (studies n-dimensional version too):

Reduce to the one-dimensional case

Analyze Pearson’s sixth moment test (with noisy estimates)

14

Page 15: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Our Results

∫Suppose w1 ∈ [ε10, 1− ε10] and |F1(x)− F2(x)|dx ≥ ε10

Theorem (Kalai, Moitra, Valiant)

There is an algorithm that (with probability at least 1− δ) learns theparameters of F within an additive ε, and the running time andnumber of samples needed are poly(1

ε, log 1).

δ

Previously, the best known bound on the running time/samplecomplexity were exponential

See also [Moitra, Valiant] and [Belkin, Sinha] for mixtures of kGaussians

15

Page 16: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Analyzing the Method of Moments

Let’s start with an easier question:

Question

What if we are given the first six moments of the mixture, exactly?

Does this uniquely determine the parameters of the mixture?

(up to a relabeling of the components)

Question

Do any two different mixtures F and F differ on at least one of thefirst six moments?

16

Page 17: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Method of Moments

17

Page 18: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Claim

One of the first six moment of F , F is different!

Proof: ∣ ∫ ∣ ∣ ∫ ∑ ∣∣ ∣ ∣ 6 ∣0 < ∣ p(x)f (x)dx∣ = ∣ prx

r f (x)dx∣x x r=1∑6 ∣ ∫ ∣∣ ∣≤ |pr |∣ x r f (x)dx∣

xr=1∑6= |pr ||Mr (F )−Mr (F )|

r=1

So ∃r∈{1,2,...,6} such that |Mr (F )−Mr (F )| > 0

18

Page 19: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Our goal is to prove the following:

Proposition∑If f (x) = k

i=1 αiN (µi , σ2i , x) is not identically zero, f (x) has at

most 2k − 2 zero crossings (αi ’s can be negative).

......

We will do it through properties of the heat equation

......

19

Page 20: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

The Heat Equation

Question

If the initial heat distribution on a one-dimensional infinite rod (κ) isf (x) = f (x , 0) what is the heat distribution f (x , t) at time t?

There is a probabilistic interpretation (σ2 = 2κt):

f (x , t) = Ez←N (0,σ2)[f (x + z , 0)]

Alternatively, this is called a convolution:∫ ∞f (x , t) = f (x + z)N (0, σ2, z)dz = f (x) ∗ N (0, σ2)

z=−∞

20

Page 21: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

The Key Facts

Theorem (Hummel, Gidas)

Suppose f (x) : R→ R is analytic and has N zeros. Then

f (x) ∗ N (0, σ2, x)

has at most N zeros (for any σ2 > 0).

Convolving by a Gaussian does not increase # of zero crossings

Fact

N (µ1, σ21, x) ∗ N (µ2, σ

22, x) = N (µ1 + µ2, σ

21 + σ2

2, x)

21

Page 22: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Recall, our goal is to prove the following:

Proposition∑If f (x) = k

i=1 αiN (µi , σ2i , x) is not identically zero, f (x) has at

most 2k − 2 zero crossings (αi ’s can be negative).

We will prove it by induction:

......

Start with k = 3 (at most 4 zero crossings),

Let’s prove it for k = 4 (at most 6 zero crossings)

......

22

Page 23: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Bounding the Number of Zero Crossings

23

Page 24: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Bounding the Number of Zero Crossings

24

Page 25: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Bounding the Number of Zero Crossings

25

Page 26: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Bounding the Number of Zero Crossings

26

Page 27: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Bounding the Number of Zero Crossings

27

Page 28: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Bounding the Number of Zero Crossings

28

Page 29: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Hence, the exact values of the first six moments determine themixture parameters!

Let Θ = {valid parameters} (in particular wi ∈ [0, 1], σi ≥ 0)

Claim

Let θ be the true parameters; then the only solutions to{ }θ ∈ Θ|Mr (θ) = Mr (θ) for r = 1, 2, ...6

are (w1, µ1, σ1, µ2, σ2) and the relabeling (1− w1, µ2, σ2, µ1, σ1)

Are these equations stable, when we are given noisy estimates?

29

Page 30: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

A Univariate Learning Algorithm

Our algorithm:

Take enough samples S so that Mr = 1∑

|S | i∈S xri is w.h.p. close

to Mr (θ) for r = 1, 2...6

(within an additive εC

2)

(within an additive εC

Compute θ such that Mr (θ) is close to Mr for r = 1, 2...6

2)

And θ must be close to θ, because solutions to this system ofpolynomial equations are stable

30

Page 31: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

Summary and Discussion

Here we gave the first efficient algorithms for learning mixturesof Gaussians with provably minimal assumptions

Key words: method of moments, polynomials, heat equation

Computational intractability is everywhere in machinelearning/statistics

Currently, most approaches are heuristic and have no provableguarantees

Can we design new algorithms for some of the fundamentalproblems in these fields?

31

Page 32: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

My Work

Natural  Algorithms  

Is  Learning    Computa.onally  Easy?  

   

32

Page 33: Algorithms Aspect of Machine Learning, Lecture 4a · Karl Pearson (1894) and the Naples Crabs ( gure due to Peter Macdonald) 0.58 0.60 0.62 0.64 0.66 0.68 0.70 0 5 10 15 20 Image

MIT OpenCourseWarehttp://ocw.mit.edu

18.409 Algorithmic Aspects of Machine LearningSpring 2015

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.