learning with missing data eran segal weizmann institute

Learning with Missing DataEran Segal

Weizmann Institute

Incomplete Data Hidden variables Missing values

Challenges Foundational – is the learning task well defined? Computational – how can we learn with missing

data?

Treating Missing Data How Should we treat missing data?

Case I: A coin is tossed on a table, occasionally it drops and measurements are not taken Sample sequence: H,T,?,?,T,?,H Treat missing data by ignoring it

Case II: A coin is tossed, but only heads are reported Sample sequence: H,?,?,?,H,?,H Treat missing data by filling it with Tails

We need to consider the data missing mechanism

Modeling Data Missing Mechanism

X = {X1,...,Xn} are random variables

OX = {OX1,...,OXn

} are observability variables Always observed

Y = {Y1,...,Yn} new random variables Val(Yi) = Val(Yi) {?}

Yi is a deterministic function of Xi and OX1:

0

1

? oO

oOXY

i

i

X

Xii

Modeling Missing Data Mechanism

X

OX

Case I(random missing

values)

X

OX

Case II(deliberate missing

values)

?)1()1(),:( MMMMM THTHDL

)1(?)(

)1()(

)(

YP

TYP

HYP

Y Y

Modeling Missing Data Mechanism

X

OX

Case I(random missing

values)

X

OX

Case II(deliberate missing

values)

?)1(1()1()1(),:( ||||M

TOHOMTO

MHO

MM

XX

T

X

H

X

THDL

)1)(1()1(?)(

)1()(

)(

||

|

|

TOHO

TO

HO

XX

X

X

YP

TYP

HYP

Y Y

Treating Missing Data When can we ignore the missing data

mechanism and focus only on the likelihood? For every Xi, Ind(Xi;OXi

)

Missing at Random (MAR) is sufficient The probability that the value of Xi is missing is independent

of its actual value given other observed values

In both cases, the likelihood decomposes

Hidden (Latent) Variables Attempt to learn a model with hidden variables

In this case, MAR always holds (variable is always missing)

Why should we care about unobserved variables?

X1 X2 X3

H

Y1 Y2 Y3

X1 X2 X3

Y1 Y2 Y3

17 parameters 59 parameters

Hidden (Latent) Variables Hidden variables also appear in clustering

Naïve Bayes model: Class variable is hidden Observed attributes are

independent given the classCluster

X1 ...X2 Xn

Hidden

Observedpossible missing values

Likelihood for Complete Data

P(Y|X)

X y0 y1

x0 y0|x0 y1|x0

x1 y0|x1 y1|x1

X

Y

X Y

x0 y0

x0 y1

x1 y0

P(X)

x0 x1

x0 x1

100100100

101010000

|||

|||

011000 ),(),(),(

])3[],3[(])2[],2[(])1[],1[():(

xyxyxyxxx

xyxxyxxyx

yxPyxPyxP

yxPyxPyxPDL

Input Data:

Likelihood:

Likelihood decomposes by variables

Likelihood decomposes within CPDs

Likelihood for Incomplete Data

P(Y|X)

X y0 y1

x0 y0|x0 y1|x0

x1 y0|x1 y1|x1

X

Y

X Y

? y0

x0 y1

? y0

P(X)

x0 x1

x0 x1

010101000

101000010101000

|

2

||

|||||

0100

0100

),(),(),(

)(),()():(

xyxxyxxyx

xyxxyxxyxxyxxyx

XxXx

yxPyxPyxP

yPyxPyPDL

Input Data:

Likelihood:

Likelihood does not decompose by variables

Likelihood does not decompose within CPDs

Computing likelihood per instance requires inference!

Bayesian Estimation

X

X[1] X[M]X[2] …X

Y

Bayesian network

Bayesian network for parameter estimation

Y[1] Y[M]Y[2] …

Y|X=0 Y|X=1

Posteriors are not independent

Identifiability Likelihood can have multiple global maxima

Example: We can rename the values of the hidden variable H If H has two values, likelihood has two global

maxima

With many hidden variables, there can be an exponential number of global maxima

Multiple local and global maxima can also occur with missing data (not only hidden variables)

H

Y

Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters Add line search and conjugate gradient methods to get fast convergence

L(D

|)

MLE from Incomplete Data

Nonlinear optimization problem

Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has better score than current point



L(D

|)

Gradient Ascent and EM Find local maxima Require multiple restarts to find approx. to the global maximum Require computations in each iteration



L(D

|)

Gradient Ascent Theorem:

Proof:

)],[|,(1)|(log

,,

mopaxPDP

iimpaxpax iiii

m pax

m paxpax

ii

iiii

moP

moP

moPDP

,

,,

)|][(

)|][(

1

)|][(log)|(log

ii pax

moP

,

)|][(

How do we compute ?

Gradient Ascent

ii

ii

ii

ii ii

ii iiii

pax

ii

pax

iiiii

pax

iiiii

pax pax

iiiii

pax pax

ii

pax

mopaxP

paxPpaxmoPpaP

paxPpaxmoPpaP

paxmoPpaxPpaP

mopaxPmoP

','

','

','

, ','

, ','','

)|][,','(

),'|'(),','|][()|'(

),|(),','|][()|'(

),,|][(),|()|(

)|][,,()|][(

Gradient Ascent

Requires computation: P(xi,pai|o[m],) for all i, m

Can be done with clique-tree algorithm, since Xi,Pai are in the same clique

m pax

ii

m pax

ii

m paxpax

ii

ii

iiii

mopaxP

mopaxP

moP

moP

moP

DP

,

,

,,

)],[|,(

)|][,,(

)|][(

1

)|][(

)|][(

1)|(log

Gradient Ascent Summary Pros

Flexible, can be extended to non table CPDs

Cons Need to project gradient onto space of legal

parameters For reasonable convergence, need to combine with

advanced methods (conjugate gradient, line search)

Expectation Maximization (EM) Tailored algorithm for optimizing likelihood

functions

Intuition Parameter estimation is easy given complete data Computing probability of missing data is “easy”

(=inference) given parameters

Strategy Pick a starting point for parameters “Complete” the data using current parameters Estimate parameters relative to data completion Iterate Procedure guaranteed to improve at each iteration

Expectation Maximization (EM) Initialize parameters to 0

Expectation (E-step): For each data case o[m] and each family X,U

compute P(X,U | o[m], i) Compute the expected sufficient statistics for each

x,u

Maximization (M-step): Treat the expected sufficient statistics as observed

and set the parameters to the MLE with respect to the ESS

m

imoxPxM i )],[|,(],[

uu

][

],[1| u

uu

i

i

M

xMix

Expectation Maximization (EM)

X

Y

X Y

? y0

x0 y1

? y0

Training data

Initial network

+ E-Step(inferenc

e)

Expected counts

N(X)

N(X,Y) M-Step(reparameteriz

e)

X

Y

Updated network

Iterate

Expectation Maximization (EM) Formal Guarantees:

L(D:i+1) L(D:i) Each iteration improves the likelihood

If i+1=i , then i is a stationary point of L(D:) Usually, this means a local maximum

Main cost: Computations of expected counts in E-Step Requires inference for each instance in training set

Exactly the same as in gradient ascent!

EM – Practical Considerations Initial parameters

Highly sensitive to starting parameters Choose randomly Choose by guessing from another source

Stopping criteria Small change in data likelihood Small change in parameters

Avoiding bad local maxima Multiple restarts Early pruning of unpromising starting points

EM in Practice – Alarm Network

PCWP CO

HRBP

HREKG HRSAT

ERRCAUTERHRHISTORY

CATECHOL

SAO2 EXPCO2

ARTCO2

VENTALV

VENTLUNG VENITUBE

DISCONNECT

MINVOLSET

VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS

PAP SHUNT

ANAPHYLAXIS

MINOVL

PVSAT

FIO2

PRESS

INSUFFANESTHTPR

LVFAILURE

ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME

HYPOVOLEMIA

CVP

BP

Alarm network Data sampled from true network 20% of data randomly deleted


Training error Test error

Partial Data: Parameter Estimation

Non-linear optimization problem Methods for learning: EM and Gradient Ascent

Exploit inference for learning

Challenges Exploration of a complex likelihood/posterior

More missing data many more local maxima Cannot represent posterior must resort to approximations

Inference Main computational bottleneck for learning Learning large networks exact inference is infeasible

resort to approximate inference

Structure Learning w. Missing Data

Distinguish two learning problems Learning structure for a given set of random

variables Introduce new hidden variables

How do we recognize the need for a new variable? Where do we introduce a newly added hidden variable

within G? Open ended and less understood…

Structure Learning w. Missing Data

Theoretically, there is no problem Define score, and search for structure that maximizes it

Likelihood term will require gradient ascent or EM

Practically infeasible Typically we have O(n2) candidates at each search step Requires EM for evaluating each candidate Requires inference for each data instance of each

candidate Total running time per search step:

O(n2 M #EM iteration cost of BN inference)

)(2

log):ˆ():( GDim

MDlDGScore GBIC

Reverse CBDelete

B C

Add B

D

Typical Search

B

C

A

DB

C

A

D B

C

A

D

B

C

A

D

Requires EM

Requires EM

Requires EM

Structural EM Basic idea: use expected sufficient statistics to

learn structure, not just parameters Use current network to complete the data using EM Treat the completed data as “real” to score

candidates Pick the candidate network with the best score Use the previous completed counts to evaluate

networks in the next step After several steps, compute a new data completion

from the current network

Structural EM Conceptually

Algorithm maintains an actual distribution Q over completed datasets as well as current structure G and parameters G

At each step we do one of the following Use <G,G> to compute a new completion Q and redefine G as

the MLE relative to Q Evaluate candidate successors G’ relative to Q and pick best

In practice Maintain Q implicitly as a model <G,G> Use the model to compute sufficient statistics MQ[x,u]

when these are needed to evaluate new structures Use sufficient statistics to compute MLE estimates of

candidate structures

Structural EM Benefits Many fewer EM runs Score relative to completed data is decomposable!

Utilize same benefits as structure learning w. complete data

Each candidate network requires few recomputations Here savings is large since each sufficient statistics

computation requires inference As in EM, we optimize a simpler score Can show improvements and convergence

An SEM step that improves in D+ space, improves real score

):,():,(

):,():,(

DGScoreEDGScoreE

DGScoreDGScore

QG

QMDLQGMDLQ

QG

QMDLGMDL

learning with missing data eran segal weizmann institute

Documents

h treat missing data

data missing mechanism

random variables o x

valy i

deterministic function

observability variables

new random variables

sample sequence