unsupervised learning with non-ignorable missing data

32
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich Zemel

Upload: sen

Post on 16-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Unsupervised Learning With Non-ignorable Missing Data. Ben Marlin Sam Roweis Rich Zemel. Machine Learning Group Talk University of Toronto Monday Oct 4, 2004. Introduction. Missing Data Theory and EM. Synthetic Data Experiments. Extensions and Future Work. Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Unsupervised Learning With  Non-ignorable Missing Data

1

Unsupervised Learning With Non-ignorable Missing Data

Machine Learning Group Talk

University of Toronto

Monday Oct 4, 2004

Ben Marlin

Sam Roweis

Rich Zemel

Page 2: Unsupervised Learning With  Non-ignorable Missing Data

2

Outline

Introduction

Missing Data Theory and EM

Synthetic Data Experiments

Extensions and Future Work

Conclusions

Models for Non-Ignorable Missing Data

Real Data Experiments

Page 3: Unsupervised Learning With  Non-ignorable Missing Data

3

Introduction The Problem of Missing Data

Missing data is a pervasive problem in machine learning and statistical data analysis.

Most large, complex data sets will be certain amount of missing data.

A fundamental question in the analysis of missing data is why is the data missing and what do we have to do about it?.

There are extreme examples of data sets in machine learning with upwards of 95% missing data (EachMovie).

Page 4: Unsupervised Learning With  Non-ignorable Missing Data

4

Introduction A Theory of Missing Data

Little and Rubin laid out a theory of missing data several decades ago that provides answers to these questions.

They describe a classification of missing data in terms of the mechanism, or process that causes the data to be missing. ie: the generative model for missing data.

They also derive the exact conditions outlining when missing data must be treated specially to obtain correct inferences based on likelihood.

Page 5: Unsupervised Learning With  Non-ignorable Missing Data

5

Introduction Types of Missing Data: MCAR

If the missing data can be explained by a simple random process like flipping a single biased coin, the missing data is missing completely at random.

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

1523

2

1534

5

2533

5

6

1 13 23 25

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 41 512 143 54

2 15

1523

2

1534

5

2533

5

6

1 13 23 25

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 41 512 143 54

2 15

1523

2

1534

5

2

33

5

6

1 13 23 25

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 41 51

143 54

2 15

15

3

2

1534

5

2

33

5

6

1 13 23 25

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 41 51

143 4

2 15

15

3

2

1534

5

2

33

5

6

1 13 23 25

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 41 51

143 4

2 15

15

3

2

1534

5

2

33

5

6

1 13 225

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 41 51

143 4

2 5

15

3

2

1534

5

2

33

5

6

1 13 225

Page 6: Unsupervised Learning With  Non-ignorable Missing Data

6

Introduction Types of Missing Data: MAR

If the probability that a data entry is missing depends only on the data entries that are observed, then the data is missing at random.

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

1523

2

1534

5

2533

5

6

1 13 23 25

2

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

523

2

1534

5

2533

5

6

1 13 23 25

5

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

523

2

1534

5

2533

5

6

1 13 23 25

3

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

5

3

2

1534

5

2533

5

6

1 13 23 25

3

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

5

3

2

1534

5

2533

5

6

1 13 23 25

2

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

5

3

2

1534

5

2533

5

6

1 133 25

5

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 15

5

3

2

1534

5

2533

5

6

1 133 25

Page 7: Unsupervised Learning With  Non-ignorable Missing Data

7

Introduction Types of Missing Data: Non-Ignorable

If the probability that a data entry is missing depends on the value of that data entry, then the missing data is non-ignorable.

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 411 512 143 54

2 14

1523

2

1534

5

2533

5

6

1 13 23 25

2

2 3 4 51Attributes

234

6

1

Da

ta C

ase

s 2 11 12 13 5

2 14

23

2

1

3

2533

6

1 13 235

1 3

4 5

Page 8: Unsupervised Learning With  Non-ignorable Missing Data

8

Introduction The Effect of Missing Data

If missing data is MCAR or MAR, then inference based on the observed data likelihood will not be biased.

If missing data is non-ignorable, then inference based on the observed data likelihood is provably biased.

4 65 4 5835 63 4 67 5 6246 45Data:

MCAR:

NI:

4 65 5835 7 24

4.90

4.90

4 535 3 4 6 5 24 4 4.10

Mean

Page 9: Unsupervised Learning With  Non-ignorable Missing Data

9

Introduction Unsupervised Learning and Missing Data

This simple mean estimation problem can be interpreted as fitting a normal distribution to the data, a simple unsupervised learning problem.

Just like the mean estimation example, any unsupervised learning algorithm that treats non-ignorable missing data as missing at random will learn biased estimates of model parameters.

Page 10: Unsupervised Learning With  Non-ignorable Missing Data

10

Introduction Research Overview

The goals of this research project are:

1. Apply the theory developed by Little and Rubin to extend the standard unsupervised learning framework to correctly handle non-ignorable missing data.

2. Apply this extended framework to augment a variety of existing models, and show that tractable learning algorithms can be obtained.

3. Demonstrate that these augmented models out perform standard models on tasks where missing data is believed to be non-ignorable.

Page 11: Unsupervised Learning With  Non-ignorable Missing Data

11

Introduction Research Overview

The current status of the project:

1. We have been able to augment mixture models to account for non-ignorable missing data.

2. We have derived efficient learning and exact inference algorithms for the augmented models.

3. We have obtained empirical results on synthetic data sets showing the augmented models learn accurately.

4. Preliminary results were recently submitted to AISTATS.

Page 12: Unsupervised Learning With  Non-ignorable Missing Data

12

Missing Data Theory and EM Notation

Complete data matrix.

Observed elements of the data matrix.

Missing elements of the data matrix.

Matrix of response indicators.

Data model.

Selection or observation model.

Page 13: Unsupervised Learning With  Non-ignorable Missing Data

13

Missing Data Theory and EM The MAR Assumption

Under this notation the MAR assumption can be expressed as follows:

Basically this says the distribution over the response indicators is independent of the missing data.

Page 14: Unsupervised Learning With  Non-ignorable Missing Data

14

Missing Data Theory and EM Observed and Full Likelihood Functions

The standard procedure for unsupervised learning is to maximize the observed data likelihood. The correct procedure is maximize the full data likelihood.

Page 15: Unsupervised Learning With  Non-ignorable Missing Data

15

Missing Data Theory and EM Expectation Maximization Algorithm

In an unsupervised learning setting with non-ignorable missing data, the correct learning procedure is to maximize the expected full log likelihood.

Page 16: Unsupervised Learning With  Non-ignorable Missing Data

16

Models for Non-Ignorable Missing Data Review: Standard Mixture ModelIn the work that follows we assume a multinomial mixture model

as the data model. It is a simple baseline model that is quite effective in many discrete domains.

Y1n Y2n Y3n YMn

Zn

n=1:N

Latent variable for case n.

Data variables for case n.

Page 17: Unsupervised Learning With  Non-ignorable Missing Data

17

n=1:N

Models for Non-Ignorable Missing Data Mixture/Fully Connected ModelIf we fully connect the response indicators to the data variables

we get the most general selection mode, but it is not tractable.

Ymn

Rmn

Zn

Latent variable

Data variables

Response indicatorsm=1:M

m=1:M

Page 18: Unsupervised Learning With  Non-ignorable Missing Data

18

Models for Non-Ignorable Missing Data Mixture/CPT-v ModelTo derive tractable learning and inference algorithms we need to

assert further independence relations.

Latent variable

Data variables

Response indicators

n=1:N

Ymn

Rmn

Zn

m=1:M

Page 19: Unsupervised Learning With  Non-ignorable Missing Data

19

Models for Non-Ignorable Missing Data Mixture/CPT-v ModelExact inference and learning for the Mixture/CPT-v model is only

slightly more complex than in a standard mixture model.

Page 20: Unsupervised Learning With  Non-ignorable Missing Data

20

Models for Non-Ignorable Missing Data Mixture/LOGIT-v,mz ModelThe LOGIT-v,mz model assumes a functional form for the missing

data parameters. It is able to model a wider range of effects.

Latent variable

Data variables

Response indicators

n=1:N

Ymn

Rmn

Zn

m=1:M

Page 21: Unsupervised Learning With  Non-ignorable Missing Data

21

Models for Non-Ignorable Missing Data Mixture/LOGIT-v,mz ModelExact inference is still possible, but learning requires gradient

based techniques for and .

Page 22: Unsupervised Learning With  Non-ignorable Missing Data

22

Synthetic Data Experiments Experimental Procedure

1. Sample mixture model parameters from Dirichlet priors.

2. Sample 5000 complete data cases from the mixture model.

3. Apply each missing data effect and resample complete data to obtain observed data.

4. Train each model on observed data only.

5. Measure prediction error on complete data set.

Page 23: Unsupervised Learning With  Non-ignorable Missing Data

23

Synthetic Data Experiments Experiment 1: CPT-v Missing Data

Page 24: Unsupervised Learning With  Non-ignorable Missing Data

24

Synthetic Data Experiments Experiment 1: Results

Page 25: Unsupervised Learning With  Non-ignorable Missing Data

25

Synthetic Data Experiments Experiment 2: LOGIT-v,mz Missing Data

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

5

10

15

20

25

30

35

40

45

50

Value Based Effect Item/Latent Variable Effect

Page 26: Unsupervised Learning With  Non-ignorable Missing Data

26

Synthetic Data Experiments Experiment 2: Results

Page 27: Unsupervised Learning With  Non-ignorable Missing Data

27

Real Data Experiments Experimental Procedure

1. Train LOGIT-v,mz model on observed data.

2. Look at parameters and full likelihood values after training.

Page 28: Unsupervised Learning With  Non-ignorable Missing Data

28

Real Data Experiments Data Sets

EachMovie Collaborative Filtering Data Set:• Base: 2.8M Ratings, 73K users, 1.6K movies, 97.6% missing• Filtering: Min 20 ratings per user.• Train: 2.1M Ratings, 30K Users, 95.6% missing

Jester Collaborative Filtering Data Set :• Base: 900K Ratings, 17K users, 100 jokes, 50.4% missing• Filtering: Continuous –10 to +10 scale to discrete 5 point scale.

Page 29: Unsupervised Learning With  Non-ignorable Missing Data

29

Real Data Experiments Results – Marginal Selection Probabilities

Page 30: Unsupervised Learning With  Non-ignorable Missing Data

30

Real Data Experiments Results – Full Data Log Likelihood

Jester EachMovie

LOGIT-v,mz -1.83036x106 -8.75037x106

MCAR MM -2.48498x106 -1.16489x107

Page 31: Unsupervised Learning With  Non-ignorable Missing Data

31

Conclusions Summary and Future Work

We have shown positive preliminary results on synthetic data with both the CPT-v, and that the LOGIT-v,mz model. We have shown that the LOGIT-v,mz model does something reasonable on real data.

To show some convincing results on real data we need to look at new procedures for collect data, and possibly new experimental procedures for validating model under this framework.

We have proposed a framework for dealing with non-ignorable missing data by augmenting existing models with a general selection model.

Page 32: Unsupervised Learning With  Non-ignorable Missing Data

32

The End