learning with missing data eran segal weizmann institute
TRANSCRIPT
Learning with Missing DataEran Segal
Weizmann Institute
Incomplete Data Hidden variables Missing values
Challenges Foundational – is the learning task well defined? Computational – how can we learn with missing
data?
Treating Missing Data How Should we treat missing data?
Case I: A coin is tossed on a table, occasionally it drops and measurements are not taken Sample sequence: H,T,?,?,T,?,H Treat missing data by ignoring it
Case II: A coin is tossed, but only heads are reported Sample sequence: H,?,?,?,H,?,H Treat missing data by filling it with Tails
We need to consider the data missing mechanism
Modeling Data Missing Mechanism
X = {X1,...,Xn} are random variables
OX = {OX1,...,OXn
} are observability variables Always observed
Y = {Y1,...,Yn} new random variables Val(Yi) = Val(Yi) {?}
Yi is a deterministic function of Xi and OX1:
0
1
? oO
oOXY
i
i
X
Xii
Modeling Missing Data Mechanism
X
OX
Case I(random missing
values)
X
OX
Case II(deliberate missing
values)
?)1()1(),:( MMMMM THTHDL
)1(?)(
)1()(
)(
YP
TYP
HYP
Y Y
Modeling Missing Data Mechanism
X
OX
Case I(random missing
values)
X
OX
Case II(deliberate missing
values)
?)1(1()1()1(),:( ||||M
TOHOMTO
MHO
MM
XX
T
X
H
X
THDL
)1)(1()1(?)(
)1()(
)(
||
|
|
TOHO
TO
HO
XX
X
X
YP
TYP
HYP
Y Y
Treating Missing Data When can we ignore the missing data
mechanism and focus only on the likelihood? For every Xi, Ind(Xi;OXi
)
Missing at Random (MAR) is sufficient The probability that the value of Xi is missing is independent
of its actual value given other observed values
In both cases, the likelihood decomposes
Hidden (Latent) Variables Attempt to learn a model with hidden variables
In this case, MAR always holds (variable is always missing)
Why should we care about unobserved variables?
X1 X2 X3
H
Y1 Y2 Y3
X1 X2 X3
Y1 Y2 Y3
17 parameters 59 parameters
Hidden (Latent) Variables Hidden variables also appear in clustering
Naïve Bayes model: Class variable is hidden Observed attributes are
independent given the classCluster
X1 ...X2 Xn
Hidden
Observedpossible missing values
Likelihood for Complete Data
P(Y|X)
X y0 y1
x0 y0|x0 y1|x0
x1 y0|x1 y1|x1
X
Y
X Y
x0 y0
x0 y1
x1 y0
P(X)
x0 x1
x0 x1
100100100
101010000
|||
|||
011000 ),(),(),(
])3[],3[(])2[],2[(])1[],1[():(
xyxyxyxxx
xyxxyxxyx
yxPyxPyxP
yxPyxPyxPDL
Input Data:
Likelihood:
Likelihood decomposes by variables
Likelihood decomposes within CPDs
Likelihood for Incomplete Data
P(Y|X)
X y0 y1
x0 y0|x0 y1|x0
x1 y0|x1 y1|x1
X
Y
X Y
? y0
x0 y1
? y0
P(X)
x0 x1
x0 x1
010101000
101000010101000
|
2
||
|||||
0100
0100
),(),(),(
)(),()():(
xyxxyxxyx
xyxxyxxyxxyxxyx
XxXx
yxPyxPyxP
yPyxPyPDL
Input Data:
Likelihood:
Likelihood does not decompose by variables
Likelihood does not decompose within CPDs
Computing likelihood per instance requires inference!
Bayesian Estimation
X
X[1] X[M]X[2] …X
Y
Bayesian network
Bayesian network for parameter estimation
Y[1] Y[M]Y[2] …
Y|X=0 Y|X=1
Posteriors are not independent
Identifiability Likelihood can have multiple global maxima
Example: We can rename the values of the hidden variable H If H has two values, likelihood has two global
maxima
With many hidden variables, there can be an exponential number of global maxima
Multiple local and global maxima can also occur with missing data (not only hidden variables)
H
Y
Gradient Ascent: Follow gradient of likelihood w.r.t. to parameters Add line search and conjugate gradient methods to get fast convergence
L(D
|)
MLE from Incomplete Data
Nonlinear optimization problem
Expectation Maximization (EM): Use “current point” to construct alternative function (which is “nice”) Guaranty: maximum of new function has better score than current point
MLE from Incomplete Data
Nonlinear optimization problem
L(D
|)
Gradient Ascent and EM Find local maxima Require multiple restarts to find approx. to the global maximum Require computations in each iteration
MLE from Incomplete Data
Nonlinear optimization problem
L(D
|)
Gradient Ascent Theorem:
Proof:
)],[|,(1)|(log
,,
mopaxPDP
iimpaxpax iiii
m pax
m paxpax
ii
iiii
moP
moP
moPDP
,
,,
)|][(
)|][(
1
)|][(log)|(log
ii pax
moP
,
)|][(
How do we compute ?
Gradient Ascent
ii
ii
ii
ii ii
ii iiii
pax
ii
pax
iiiii
pax
iiiii
pax pax
iiiii
pax pax
ii
pax
mopaxP
paxPpaxmoPpaP
paxPpaxmoPpaP
paxmoPpaxPpaP
mopaxPmoP
','
','
','
, ','
, ','','
)|][,','(
),'|'(),','|][()|'(
),|(),','|][()|'(
),,|][(),|()|(
)|][,,()|][(
Gradient Ascent
Requires computation: P(xi,pai|o[m],) for all i, m
Can be done with clique-tree algorithm, since Xi,Pai are in the same clique
m pax
ii
m pax
ii
m paxpax
ii
ii
iiii
mopaxP
mopaxP
moP
moP
moP
DP
,
,
,,
)],[|,(
)|][,,(
)|][(
1
)|][(
)|][(
1)|(log
Gradient Ascent Summary Pros
Flexible, can be extended to non table CPDs
Cons Need to project gradient onto space of legal
parameters For reasonable convergence, need to combine with
advanced methods (conjugate gradient, line search)
Expectation Maximization (EM) Tailored algorithm for optimizing likelihood
functions
Intuition Parameter estimation is easy given complete data Computing probability of missing data is “easy”
(=inference) given parameters
Strategy Pick a starting point for parameters “Complete” the data using current parameters Estimate parameters relative to data completion Iterate Procedure guaranteed to improve at each iteration
Expectation Maximization (EM) Initialize parameters to 0
Expectation (E-step): For each data case o[m] and each family X,U
compute P(X,U | o[m], i) Compute the expected sufficient statistics for each
x,u
Maximization (M-step): Treat the expected sufficient statistics as observed
and set the parameters to the MLE with respect to the ESS
m
imoxPxM i )],[|,(],[
uu
][
],[1| u
uu
i
i
M
xMix
Expectation Maximization (EM)
X
Y
X Y
? y0
x0 y1
? y0
Training data
Initial network
+ E-Step(inferenc
e)
Expected counts
N(X)
N(X,Y) M-Step(reparameteriz
e)
X
Y
Updated network
Iterate
Expectation Maximization (EM) Formal Guarantees:
L(D:i+1) L(D:i) Each iteration improves the likelihood
If i+1=i , then i is a stationary point of L(D:) Usually, this means a local maximum
Main cost: Computations of expected counts in E-Step Requires inference for each instance in training set
Exactly the same as in gradient ascent!
EM – Practical Considerations Initial parameters
Highly sensitive to starting parameters Choose randomly Choose by guessing from another source
Stopping criteria Small change in data likelihood Small change in parameters
Avoiding bad local maxima Multiple restarts Early pruning of unpromising starting points
EM in Practice – Alarm Network
PCWP CO
HRBP
HREKG HRSAT
ERRCAUTERHRHISTORY
CATECHOL
SAO2 EXPCO2
ARTCO2
VENTALV
VENTLUNG VENITUBE
DISCONNECT
MINVOLSET
VENTMACHKINKEDTUBEINTUBATIONPULMEMBOLUS
PAP SHUNT
ANAPHYLAXIS
MINOVL
PVSAT
FIO2
PRESS
INSUFFANESTHTPR
LVFAILURE
ERRBLOWOUTPUTSTROEVOLUMELVEDVOLUME
HYPOVOLEMIA
CVP
BP
Alarm network Data sampled from true network 20% of data randomly deleted
EM in Practice – Alarm Network
Training error Test error
EM in Practice – Alarm Network
Partial Data: Parameter Estimation
Non-linear optimization problem Methods for learning: EM and Gradient Ascent
Exploit inference for learning
Challenges Exploration of a complex likelihood/posterior
More missing data many more local maxima Cannot represent posterior must resort to approximations
Inference Main computational bottleneck for learning Learning large networks exact inference is infeasible
resort to approximate inference
Structure Learning w. Missing Data
Distinguish two learning problems Learning structure for a given set of random
variables Introduce new hidden variables
How do we recognize the need for a new variable? Where do we introduce a newly added hidden variable
within G? Open ended and less understood…
Structure Learning w. Missing Data
Theoretically, there is no problem Define score, and search for structure that maximizes it
Likelihood term will require gradient ascent or EM
Practically infeasible Typically we have O(n2) candidates at each search step Requires EM for evaluating each candidate Requires inference for each data instance of each
candidate Total running time per search step:
O(n2 M #EM iteration cost of BN inference)
)(2
log):ˆ():( GDim
MDlDGScore GBIC
Reverse CBDelete
B C
Add B
D
Typical Search
B
C
A
DB
C
A
D B
C
A
D
B
C
A
D
Requires EM
Requires EM
Requires EM
Structural EM Basic idea: use expected sufficient statistics to
learn structure, not just parameters Use current network to complete the data using EM Treat the completed data as “real” to score
candidates Pick the candidate network with the best score Use the previous completed counts to evaluate
networks in the next step After several steps, compute a new data completion
from the current network
Structural EM Conceptually
Algorithm maintains an actual distribution Q over completed datasets as well as current structure G and parameters G
At each step we do one of the following Use <G,G> to compute a new completion Q and redefine G as
the MLE relative to Q Evaluate candidate successors G’ relative to Q and pick best
In practice Maintain Q implicitly as a model <G,G> Use the model to compute sufficient statistics MQ[x,u]
when these are needed to evaluate new structures Use sufficient statistics to compute MLE estimates of
candidate structures
Structural EM Benefits Many fewer EM runs Score relative to completed data is decomposable!
Utilize same benefits as structure learning w. complete data
Each candidate network requires few recomputations Here savings is large since each sufficient statistics
computation requires inference As in EM, we optimize a simpler score Can show improvements and convergence
An SEM step that improves in D+ space, improves real score
):,():,(
):,():,(
DGScoreEDGScoreE
DGScoreDGScore
QG
QMDLQGMDLQ
QG
QMDLGMDL