overview: missing data problems in data theory of missing ...marlin/research/phd... · benjamin...
TRANSCRIPT
Missing Data Problems in Machine Learning
Benjamin Marlin
1Department of Computer Science, University of Toronto
Missing Data Problems in Missing Data Problems in Missing Data Problems in Missing Data Problems in
Machine LearningMachine LearningMachine LearningMachine Learning
Senate Thesis DefenseSenate Thesis DefenseSenate Thesis DefenseSenate Thesis Defense
Ben MarlinMachine Learning Group
Department of Computer ScienceUniversity of Toronto
April 8, 2008
Missing Data Problems in Machine Learning
Benjamin Marlin
2Department of Computer Science, University of Toronto
Overview: Overview: Overview: Overview:
Theory of Missing Data
Unsupervised Learning - MAR
Classification with Missing Features
Classification with Complete Data
Unsupervised Learning - NMAR
Collaborative Prediction
Missing Data Problems in Machine Learning
Benjamin Marlin
3Department of Computer Science, University of Toronto
Missing Data
Observed Data
Missing
Dimensions
Observed
Dimensions
Response Vector
Data Vector
Introduction: Introduction: Introduction: Introduction: Notation for Missing Data
0.30.70.20.90.1
11001
541
32
0.30.70.1
0.20.9
Missing Data Problems in Machine Learning
Benjamin Marlin
4Department of Computer Science, University of Toronto
Theory of Missing Data: Factorizations
Data/Selection Model Factorization:
• The probability of selection depends on the true values
of the data variables and latent variables.
MAR:
MCAR:
NMAR: No simplification in general.
Classification of Missing Data:
X
Missing Data Problems in Machine Learning
Benjamin Marlin
5Department of Computer Science, University of Toronto
Theory of Missing Data: Inference
MCAR/MAR Posterior:
NMAR Posterior:
Missing Data Problems in Machine Learning
Benjamin Marlin
6Department of Computer Science, University of Toronto
Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- MAR:MAR:MAR:MAR: Models
Finite Bayesian
Mixture Model
Dirichlet Process
Mixture Model
Factor Analysis
Mixture Model
Missing Data Problems in Machine Learning
Benjamin Marlin
7Department of Computer Science, University of Toronto
Restricted Boltzmann
Machine for Missing Data
Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- MAR:MAR:MAR:MAR: Models
Restricted Boltzmann
Machine for Missing Data
Missing Data Problems in Machine Learning
Benjamin Marlin
8Department of Computer Science, University of Toronto
Unsupervised Learning – NMAR:
Data Sets: Yahoo!
Collected ratings for randomly selected songs and combined them with existing ratings for user selected songs to form a novel collaborative filtering data set.
User Selected Randomly Selected
Missing Data Problems in Machine Learning
Benjamin Marlin
9Department of Computer Science, University of Toronto
Unsupervised Learning – NMAR:
Data Sets: JesterJester gauge set of 10 jokes used as complete data. Synthetic missing data was added.
• 15,000 users randomly selected
• Missing data model: µv(s) = s(v-3)+0.5
Missing Data Problems in Machine Learning
Benjamin Marlin
10Department of Computer Science, University of Toronto
Unsupervised Learning – NMAR:
Basic Experimental Protocol• We train on ratings for user selected items, and test on ratings for both user selected items, and randomly select items.
User Selected Ratings
Randomly Selected
Ratings
23455
134
4132
23455
134
4132
23455
134
4132
23455
134
4132
Observed Ratings
Test Ratings for User Selected Items
Test Ratings
for Randomly Selected items23455
134
4132
Missing Data Problems in Machine Learning
Benjamin Marlin
11Department of Computer Science, University of Toronto
Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- NMAR:NMAR:NMAR:NMAR: Models
Finite Bayesian
Mixture Model/CPT-v
Dirichlet Process
Mixture Model/CPT-v
Missing Data Problems in Machine Learning
Benjamin Marlin
12Department of Computer Science, University of Toronto
Unsupervised Learning Unsupervised Learning Unsupervised Learning Unsupervised Learning ---- NMAR:NMAR:NMAR:NMAR: Models
Conditional Restricted
Boltzmann Machine/E-v
Finite Bayesian
Mixture Model/LOGIT-vd
Missing Data Problems in Machine Learning
Benjamin Marlin
14Department of Computer Science, University of Toronto
Unsupervised Learning – NMAR:
Comparison of Results on Yahoo! Data
Missing Data Problems in Machine Learning
Benjamin Marlin
15Department of Computer Science, University of Toronto
Unsupervised Learning – NMAR:
NEW: Ranking Results
• : mean of posterior predictive distribution for test item i.
• : rank of test item i according to .
• : rank of test item i according to .
Missing Data Problems in Machine Learning
Benjamin Marlin
16Department of Computer Science, University of Toronto
Unsupervised Learning – NMAR:
NEW: Comparison of Yahoo! Ranking Results
Strong Generalization:
Missing Data Problems in Machine Learning
Benjamin Marlin
17Department of Computer Science, University of Toronto
Classification: Imputation
Multiple Imputation: Replace missing feature values with samples of xm given xo drawn from several imputation models.
1 5
Mixture of Factor Analyzers
K=3, Q=1
Missing Data Problems in Machine Learning
Benjamin Marlin
18Department of Computer Science, University of Toronto
Classification: Reduced Models
Reduced Models: Each observed data subspace defined by a pattern of missing data gives a separate classification problem.
R=[0,1] R=[1,1]
R=[1,0]
Missing Data Problems in Machine Learning
Benjamin Marlin
19Department of Computer Science, University of Toronto
Classification: Response Augmentation
Response Augmentation: Set missing features to zero and augment feature representation with response indicators.
X=[0.0, 1.1, 0, 1]~
X=[2.0, 1.1, 1, 1]~
X=[-2.8, 0, 1, 0]~
Missing Data Problems in Machine Learning
Benjamin Marlin
20Department of Computer Science, University of Toronto
Classification: Generative Models
Generative Model (LDA-FA):
Predictive Distribution with Missing Data:
Missing Data Problems in Machine Learning
Benjamin Marlin
21Department of Computer Science, University of Toronto
Results: Linear Discriminant Analysis
Training Data Generative
Learning
Discriminative
Learning
Missing Data Problems in Machine Learning
Benjamin Marlin
22Department of Computer Science, University of Toronto
Classification: Logistic RegressionTraining Data Zero Imputation Mean Imputation
Mix FA Imputation Augmented Reduced
Missing Data Problems in Machine Learning
Benjamin Marlin
23Department of Computer Science, University of Toronto
Classification: : Gaussian KLRTraining Data Zero Imputation Mean Imputation
Augmented Reduced Mix FA Imputation
Missing Data Problems in Machine Learning
Benjamin Marlin
24Department of Computer Science, University of Toronto
Classification: Neural NetworksTraining Data Zero Imputation Mean Imputation
Augmented Reduced Mix FA Imputation
Missing Data Problems in Machine Learning
Benjamin Marlin
25Department of Computer Science, University of Toronto
Classification:
UCI Thyroid-Sick
Missing Data Problems in Machine Learning
Benjamin Marlin
26Department of Computer Science, University of Toronto
Classification:
MNIST Digit Classification with Missing Data
Missing Data Problems in Machine Learning
Benjamin Marlin
27Department of Computer Science, University of Toronto
Classification:
MNIST Digit Classification with Missing Data
Missing Data Problems in Machine Learning
Benjamin Marlin
28Department of Computer Science, University of Toronto
The End