jonas peters, mateo rojas-carulla, amar shah machine ...cbl.eng.cam.ac.uk › ... › readinggroup...

Causal Inference

Jonas Peters, Mateo Rojas-Carulla, Amar Shah

Machine Learning Group, University of Cambridge

23rd April 2015

is based on work by ...

UCLA: Judea Pearl

CMU: Peter Spirtes, Clark Glymour, Richard Scheines

Harvard University: Donald Rubin, Jamie Robins

ETH Zurich: Peter Buhlmann, Nicolai Meinshausen

Max-Planck-Institute Tubingen: Dominik Janzing, Bernhard Scholkopf

University of Amsterdam: Joris Mooij

Patrik Hoyer

... many others

Jonas Peters, Mateo Rojas-Carulla, Amar Shah () Causal Inference 23 April 2015

Charig et al.: “Comparison of treatment of renal calculi by open surgery, (...) ”, British Medical Journal, 1986

Simpson’s Paradox

Events C and E , F and ¬F describing sub populations.

P(E |C ) > P(E |¬C )

P(E |C ,F ) < P(E |¬C ,F )

P(E |C ,¬F ) < P(E |¬C ,¬F )

Key distinction between seeing and doing.

E : recovery

C : treatment

F : size of stones

underlying ground truth:

recovery size of stone

treatment

Rain Sprinkler

Season

Slippery

S1: ”Turning the sprinkler on would not affect the rain”S2: ”The state of the sprinkler is independent of the state of the rain”Pearl.: “Causality: Models, Reasoning, and Inference ”

We often work with i.i.d. samples from a joint distribution P. Claim:

Many times, we are interested in a different distribution P 6= P.

speak . . . CAUSALITY!

Pearl do calculus

Consider a Markovian model (acyclic directed graph with no cycles) andall noise terms are jointly independent.

Theorem

The Causal Markov Condition: Any distribution generated by a Markovianmodel M can be factorized as:

P(v1, . . . , vn) =∏

P(vi |pai )

where V1, . . . ,Vn are the endogenous variables in M, and pai are values ofthe parents of Vi in the causal diagram associated with M.

Pearl.: “Causality: Models, Reasoning, and Inference ”

Corollary

For any Markovian model, the distribution generated by an interventiondo(X = x0) on a set X of endogenous variables is given by the truncatedfactorization:

P(v1, . . . , vk |do(xo)) =∏

i |Vi /∈X

P(vi |pai )|x=x0

where P(vi |pai ) are the pre-intervention conditional probabilities.

Factorization: P(X ,Y ,Z ) = P(Z |Y )P(Y |X )P(X )

Truncate: P(Y ,Z |do(x0)) = P(Z |Y )P(Y |x0)

Marginalize: P(Y |do(x0)) =∑

z P(Y ,Z |do(x0)) = P(Y |x0)

Coincides with conditional probability!

Factorization: P(X ,Y ,Z ) = P(Y |X ,Z )P(X |Z )P(Z )

Truncate: P(Y ,Z |do(x0)) = P(Z )P(Y |x0,Z )

Marginalize:P(Y |do(x0)) =

∑z P(Y ,Z |do(x0)) =

∑z P(Z )P(Y |x0,Z )

This does not (in general) equal P(Y |x0) =∑

z P(Y |x0,Z )P(Z |x0).

P(X1, . . . ,X3) has been generated by a structural equation model if

X1 = f1(X3,N1) X1 = f1(N1)

X2 = f2(X1,X3,N2)

X3 = f3(N3)

• Ni jointly independent

• G0 has no cycles

Given: graph and P.

X1 = f1(X3,N1) X1 = f1(N1)

X2 = f2(X1,X3,N2)

X3 = f3(N3)

treatmentG0

Given: graph and P. We (Pearl et al.) can then compute P.

X1 = f1(X3,N1) X1 = f1(N1)

X2 = f2(X1,X3,N2)

X3 = f3(N3)

treatmentG0

IMPORTANT: modularity, autonomy: Aldrich 1989, Pearl 2009, Scholkopf et al. 2012, . . .

Back to Simpson’s paradox: Remember the inequalities:

P(E |C ) > P(E |¬C )

P(E |C ,F ) < P(E |¬C ,F )

P(E |C ,¬F ) < P(E |¬C ,¬F )

What does do calculus tell us?

P(E |do(C ),F ) < P(E |do(¬C ),F )

P(E |do(C ),¬F ) < P(E |do(¬C ),¬F )

Also: P(F |do(C )) = P(F |do(¬C )) = P(F ). Thus:

P(E |do(C )) = P(E |do(C ),F )P(F ) + P(E |do(C ),¬F )P(¬F )

P(E |do(¬C )) = P(E |do(¬C ),F )P(F ) + P(E |do(¬C ),¬F )P(¬F )

This impliesP(E |do(C )) < P(E |do(¬C ))

.And the paradox is gone! Pearl.: “Causality: Models, Reasoning, and Inference ”

P(E |do(C ),F ) < P(E |do(¬C ),F )

P(E |do(C ),¬F ) < P(E |do(¬C ),¬F )

Also: P(F |do(C )) = P(F |do(¬C )) = P(F ). Thus:

P(E |do(C )) = P(E |do(C ),F )P(F ) + P(E |do(C ),¬F )P(¬F )

P(E |do(¬C )) = P(E |do(¬C ),F )P(F ) + P(E |do(¬C ),¬F )P(¬F )

This impliesP(E |do(C )) < P(E |do(¬C ))

.And the paradox is gone! Pearl.: “Causality: Models, Reasoning, and Inference ”

��

Bottou et al.: “Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising”, JMLR 2013

Given: graph and P.

X1 = f1(X3,N1)

X2 = f2(X1,N2) X2 = f2(X1, N2)

X3 = f3(N3)

X4 = f4(X2,X3,N4)

main line user intention

user dataG0

Given: graph and P. We (RL, IPW [Horv.-Th.], ...) can then compute P.

X1 = f1(X3,N1)

X2 = f2(X1,N2) X2 = f2(X1, N2)

X3 = f3(N3)

X4 = f4(X2,X3,N4)

user dataG0

Ok, but what if there are hiddens?

Ok, but how do we find the graph?

X1 = f1(X3,N1)

X2 = f2(X1,N2) X2 = f2(X1, N2)

X3 = f3(N3)

X4 = f4(X2,X3,N4)

user dataG0

X1 = f1(X3,N1)

X2 = f2(X1,N2) X2 = f2(X1, N2)

X3 = f3(N3)

X4 = f4(X2,X3,N4)

user dataG0

Independence-based Methods

Reichenbach: (common cause principle) Assume that X ⊥⊥� Y . Then

X causes Y ,

Y causes X ,

there is a hidden common cause or

combination of the above.

CM ⊥⊥�NL

Quinn et al: “the strength of the association . . . does suggest that theabsence of a daily period of darkness during childhood is a potentialprecipitating factor in the development of myopia”Quinn, Shin, Maguire, Stone: “Myopia and ambient lighting at night”, Nature 1999

CM ⊥⊥�NL

Quinn et al: “the strength of the association . . . does suggest that theabsence of a daily period of darkness during childhood is a potentialprecipitating factor in the development of myopia”Quinn, Shin, Maguire, Stone: “Myopia and ambient lighting at night”, Nature 1999

It turned out that NL ⊥⊥ CM |PM and therefore ...

Zadnik, Jones, Irvin, Kleinstein, Manny, Shin, Mutti: “Vision: Myopia and ambient night-time lighting”, Nature 2000

Gwiazda, Ong, Held, Thorn: “Vision: Myopia and ambient night-time lighting”, Nature 2000

It turned out that NL ⊥⊥ CM |PM and therefore ...

Zadnik, Jones, Irvin, Kleinstein, Manny, Shin, Mutti: “Vision: Myopia and ambient night-time lighting”, Nature 2000

Gwiazda, Ong, Held, Thorn: “Vision: Myopia and ambient night-time lighting”, Nature 2000

W X Y Z

Some definitions:

path - sequence of edges, regardless of direction e.g. W, X, Y, Z

collider - node with > 1 arrows pointing to it e.g. X

unblocked path - a path not traversing collider edges e.g. X, Y, Z

W X Y Z

Some definitions:

W X Y Z

Some definitions:

W X Y Z

Some definitions:

W X Y Z

Nodes a and b are d-connected if there is an unblocked path betweenthem e.g. X and Z are, W and Y are not. W and Y are d-separated.

Nodes a and b are d-connected given a set of nodes C, if there existsan unblocked path between a and b containing no node in C.

Conditioning on a collider, or one of its descendants, unblocks thepaths that the collider previously blocked.

W X Y Z

A and B are d-separated by nodes C =⇒ A ⊥⊥ B|C .

But... we want to learn the graph from conditional independence tests.So what can we say about the converse?

Faithfulness: Distribution is faithful to the graph if whenever A and B arenot d-separated by C, then A ⊥⊥� B|C .

P(X1,...,X4)

X1 ⊥⊥ X2

X2 ⊥⊥ X3

X1 ⊥⊥ X4 | {X3}X1 ⊥⊥ X2 | {X3}X2 ⊥⊥ X3 | {X1}

X4 = f4(X3,N4)

X3 = f3(X1,N3)

X2 = f2(N2)

X1 = f1(N1)

Ni jointly independent

independence

G′′

Faithfuln.Markov

unique? trivial

Method: IC (Pearl 2009); PC, FCI (Spirtes et al., 2000)

1 Find all (cond.) independences from the data.

2 Select the DAG(s) that corresponds to these independences.

P(X1,...,X4)

X1 ⊥⊥ X2

X2 ⊥⊥ X3

X1 ⊥⊥ X4 | {X3}X1 ⊥⊥ X2 | {X3}X2 ⊥⊥ X3 | {X1}

X4 = f4(X3,N4)

X3 = f3(X1,N3)

X2 = f2(N2)

X1 = f1(N1)

independence

G′′

Faithfuln.Markov

unique? trivial

P(X1,...,X4)

X1 ⊥⊥ X2

X2 ⊥⊥ X3

X1 ⊥⊥ X4 | {X3}X1 ⊥⊥ X2 | {X3}X2 ⊥⊥ X3 | {X1}

X4 = f4(X3,N4)

X3 = f3(X1,N3)

X2 = f2(N2)

X1 = f1(N1)

independence

G′′

Faithfuln.Markov

unique? trivial

P(X1,...,X4)

X1 ⊥⊥ X2

X2 ⊥⊥ X3

X1 ⊥⊥ X4 | {X3}X1 ⊥⊥ X2 | {X3}X2 ⊥⊥ X3 | {X1}

X4 = f4(X3,N4)

X3 = f3(X1,N3)

X2 = f2(N2)

X1 = f1(N1)

independence

G′′

Faithfuln.Markov

unique? trivial

1 Find all (cond.) independences from the data. Be smart.

Conditional independence is a symmetric quantity, so there are a bunch ofDAGs may be valid. All have X ⊥⊥ Z |Y .

The set of admissible DAGs based on the conditional independence testsare called Markov equivalent.

Conditional independence is a symmetric quantity, so there are a bunch ofDAGs may be valid. All have X ⊥⊥ Z |Y .

The set of admissible DAGs based on the conditional independence testsare called Markov equivalent.

All the above have X ⊥⊥ Z |Y .

The only X, Y, Z configuration where X ⊥⊥� Z |Y , is when Y is a collider:

Here, X,Y,Z form a v-structure (collider with independent causes).

The PC algorithm: Step 1

Form the skeleton of the graph

Start with a fully connected graph

Remove edges X-Y where X ⊥⊥ Y

Remove edges X-Y for which there is a neighbour Z 6= Y of X suchthat X ⊥⊥ Y |ZRemove edges X-Y for which there are 2 neighbours Z1,Z2 6= Y of Xsuch that X ⊥⊥ Y |Z1,Z2

Remove edges X-Y for which there is a neighbour Z 6= Y of X suchthat X ⊥⊥ Y |Z

Remove edges X-Y for which there are 2 neighbours Z1,Z2 6= Y of Xsuch that X ⊥⊥ Y |Z1,Z2

Find v-structures

Search over triples X-Y-Z, where X and Z are not connected

Since X and Z not connected, there is a separating set S s.t.X ⊥⊥ Z |S .

Is Y in the set S? If not, then X-Y-Z must form a v-structure.

Direct further edges

We cannot have cycles and have found all v-structures.Use this knowledge to fill in other arrows.

Restricted Structural Equation Models

Mooij, JP, Janzing, Zscheischler, Scholkopf: “Distinguishing cause from effect using observational data: methods and

benchmarks”, submitted 2014

Assume P(X1, . . . ,X4) has been generated by

X1 = f1(X3,N1)

X2 = N2

X3 = f3(X2,N3)

X4 = f4(X2,X3,N4)

Structural equation model.Can the DAG be recovered from P(X1, . . . ,X4)?

No.JP, J. Mooij, D. Janzing and B. Scholkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 2014

P. Buhlmann, JP, J. Ernest: CAM: Causal add. models, high-dim. order search and penalized regr., Annals of Statistics 2014

X1 = f1(X3,N1)

X2 = N2

X3 = f3(X2,N3)

X4 = f4(X2,X3,N4)

Structural equation model.Can the DAG be recovered from P(X1, . . . ,X4)? No.JP, J. Mooij, D. Janzing and B. Scholkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 2014

Intuition: Y = f (X ,N) with discrete N taking d values N1, . . . ,Nd .

Noise values could switch between d different mechanisms f1, . . . , fd !

X1 = f1(X3) + N1

X2 = N2

X3 = f3(X2) + N3

X4 = f4(X2,X3) + N4

• Ni ∼ N (0, σ2i ) jointly independent

Additive noise model with Gaussian noise.Can the DAG be recovered from P(X1, . . . ,X4)? Yes iff fi nonlinear.JP, J. Mooij, D. Janzing and B. Scholkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 2014

Consider a distribution generated by

Y = f (X ) + NY

with NY ,Xind∼ N

Then, if f is nonlinear, there is no

X = g(Y ) + MX

with MX ,Yind∼ N

JP, J. Mooij, D. Janzing and B. Scholkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 2014

Consider a distribution generated by

Y = f (X ) + NY

with NY ,Xind∼ N

Then, if f is nonlinear, there is no

X = g(Y ) + MX

with MX ,Yind∼ N

JP, J. Mooij, D. Janzing and B. Scholkopf: Causal Discovery with Continuous Additive Noise Models, JMLR 2014

Consider a distribution corresponding to

Y = X 3 + NY

with NY ,Xind∼ N

X ∼ N (1, 0.52)

NY ∼ N (0, 0.42)

−0.5 0.0 0.5 1.0 1.5 2.0 2.5

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4

gam(X ~ s(Y))$residuals

Let P(X1, . . . ,Xp) be generated by an ...

conditions identif.structural equation model: Xi = fi (XPAi

,Ni ) - 7

additive noise model: Xi = fi (XPAi) + Ni nonlin. fct. 3

causal additive model: Xi =∑

k∈PAifik (Xk ) + Ni nonlin. fct. 3

linear Gaussian model: Xi =∑

k∈PAiβik Xk + Ni linear fct. 7

(results hold for Gaussian noise)

Find causal order + variable selectionRESIT Algorithm

But finding best causal order is intractable for many nodes!

# DAGs with 13 nodes:18676600744432035186664816926721

# orderings with 13 nodes:6227020800

Find causal order + variable selectionRESIT AlgorithmBut finding best causal order is intractable for many nodes!

# DAGs with 13 nodes:18676600744432035186664816926721

# orderings with 13 nodes:6227020800

Real Data: cause-effect pairs

0 20 40 60 80 1000

Significant

Not significant

Decision rate (%)

Accura

LiNGaM

Additive Noise

Real Data: Time Series

Instrumental Variables

Back to basics... linear regression.

y = βx + u

Assumption: no dependence between x and u.

Consider regressing GDP (y) on years of schooling (x).The error term includes terms which embody all factors other thanschooling e.g. ability.

Suppose person has high u from high (unobserved) ability.Directly increases y , and possibly indirectly via x too!

OLS regression will give β which includes the dependence of x on y directlyAND via u. This does not tell us what happens when we intervene on x .

The problem is the endogeneity of x .We need to generate only exogenous variation in x .

Ideally, we would carry out experiments to control the variations.Often infeasible e.g. hard to control for ability and many other factors.

This is where instrumental variables come in.

z is an instrumental variable if (i) it is uncorrelated with u and (ii) iscorrelated with x .

Condition (i) excludes z from being a regressor.

z is an instrumental variable if (i) it is uncorrelated with u and (ii) iscorrelated with x .

Condition (i) excludes z from being a regressor.

Two stage least squares:

(1) Isolate the part of x uncorrelated with u by regressing x on z .

xi = π0 + π1zi + εi

Compute xi = π0 + π1zi .

(2) Now regress y on xyi = βxi + τi

βTSLS is a consistent estimator of β.

Can be extended to multiple dimensions and to non-linear models.

Example 1:

Estimate change in market demand (y) due to exogenous changes tomarket price (x).

Demand clearly depends on price. But price is not exogenously available.Suitable instrument?

z needs to directly effect price, but not demand... how about marketsupply

Example 1:

Example 2:

Estimate change in GDP (y) due to exogenous change in years ofschooling (x).

Popular choice for z is proximity to university.People living far from university are likely to stay in school for a shortertime (condition (2) satisfied).

Condition (1) likely satisfied, but can argue people far from university aremore likely to be in lower wage areas, which would affect y .

Example 2:

Summary

Instrumental variables are useful for measuring how y changes whenyou intervene on x

Valid IVs are difficult to find

βTSLS can be very inefficient

Observing data from different environments

Key idea:P(Y |PAY ) remains invariant if the struct. equ. for Y does not change.

X1 = f1(X3,N1)

Y = f2(X1,N2)

X3 = f3(N3)

X4 = f4(Y ,X3,N4)

X1 = f1(N1)

Y = f2(X1,N2)

X3 = f3(N3)

X4 = f4(Y ,X3,N4)

X1 = f1(X3,N1)

Y = f2(X1,N2)

X3 = f3(N3)

X4 = f4(Y ,X3, N4)

X1 = f1(N1)

Y = f2(X1,N2)

X3 = f3(X1,X4, N3)

X4 = f4(Y , N4)

Observe target Y and p covariates X in different “environments” e ∈ E :

(X e ,Y e) ∼ Pe .

1 Look for sets of predictors that lead to invariant prediction.

2 The variables that are in all of those sets are called S(E).

Observe target Y and p covariates X in different “environments” e ∈ E :

(X e ,Y e) ∼ Pe .

1 Look for sets of predictors that lead to invariant prediction.

2 The variables that are in all of those sets are called S(E).

Example 1:S∗ = {X1,X2}, E = {1}

chocolate mountains

e = 1: observational

S(E) = ∅

Example 1:S∗ = {X1,X2}, E = {1}

chocolate mountains

e = 1: observational

S(E) = ∅

Example 2:S∗ = {X1,X2}, E = {1, 2}

chocolate mountains

e = 1: observational e = 2: intervention on X1

S(E) = {X1}

Example 2:S∗ = {X1,X2}, E = {1, 2}

chocolate mountains

S(E) = {X1}

Example 3:S∗ = {X1,X2}, E = {1, 2}

chocolate mountains

S(E) = {X1,X2}

Theorem1 No mistakes:

S(E) ⊆ S∗.

2 No chances with one environment:

#E = 1 =⇒ S(E) = ∅.

3 Seeing more environments helps:

S(E1) ⊆ S(E2) if E1 ⊆ E2

4 Sufficient conditions for S(E) = S∗:

a) many “generic” interventions: on each node except Y ORb) single “generic” intervention: on a “youngest” parent of Y that is

directly connected to all other parents of Y .

JP, P. Buhlmann, N. Meinshausen: Causal inference using invariant prediction: identification and conf. interv., arXiv 1501.01332

S(E) ⊆ S∗.

#E = 1 =⇒ S(E) = ∅.

S(E1) ⊆ S(E2) if E1 ⊆ E2

S(E) ⊆ S∗.

#E = 1 =⇒ S(E) = ∅.

S(E1) ⊆ S(E2) if E1 ⊆ E2

S(E) ⊆ S∗.

#E = 1 =⇒ S(E) = ∅.

S(E1) ⊆ S(E2) if E1 ⊆ E2

Method for finite samples:1 Test whether S leads to invariant prediction.2 S(E) contains the variables that are in all those S .

Theorem

Assume that the probability of falsely rejecting S∗ is less than α. Then

S(E) ⊆ S∗)≥ 1− α

possible test 1: test whether the regression models are the same forgroup e and −e (Chow, 1960) (inversion of cov. matrix of Rese)possible test 2: linear regression on pooled data; test whether Rese

of group e have same mean and variance as Res−e .(easy) tricks for computations and high-dimensions.

Theorem

S(E) ⊆ S∗)≥ 1− α

Theorem

S(E) ⊆ S∗)≥ 1− α

Theorem

S(E) ⊆ S∗)≥ 1− α

possible test 1: test whether the regression models are the same forgroup e and −e (Chow, 1960) (inversion of cov. matrix of Rese)

possible test 2: linear regression on pooled data; test whether Rese

Theorem

S(E) ⊆ S∗)≥ 1− α

of group e have same mean and variance as Res−e .

(easy) tricks for computations and high-dimensions.

Theorem

S(E) ⊆ S∗)≥ 1− α

Real data: genetic perturbation experiments for yeast (Kemmeren et al.,2014)

p = 6170 genes

nobs = 160 wild-types

nint = 1479 gene deletions (targets known)

true hits: ≈ 9% of pairs

our method: E = {obs, int}

p = 6170 genes

our method: E = {obs, int}

p = 6170 genes

our method: E = {obs, int}Jonas Peters, Mateo Rojas-Carulla, Amar Shah () Causal Inference 23 April 2015

ACTIVITY GENE 5954

−1.0 −0.5 0.0 0.5

observational training data

ACTIVITY GENE 5954−1.0 −0.5 0.0 0.5

interventional training data(interv. on genes other than 5954 and 4710)

ACTIVITY GENE 5954

−5 −4 −3 −2 −1 0 1

interventional test data point(intervention on gene 5954)

most significant pair

ACTIVITY GENE 3729

−0.5 0.0 0.5 1.0

ACTIVITY GENE 3729−0.5 0.0 0.5 1.0

ACTIVITY GENE 3729

−4 −3 −2 −1 0 1 2

2 interventional test data point(intervention on gene 3729)

2nd most significant pair

ACTIVITY GENE 3672

−0.5 0.0 0.5 1.0 1.5−0.

ACTIVITY GENE 3672−0.5 0.0 0.5 1.0 1.5

ACTIVITY GENE 3672

−3 −2 −1 0 1 2

2 interventional test data point(intervention on gene 3672)

3rd most significant pair

proposedGIES IDA

marginal corr. randommethod observ. pooled guessing

# of true6 2 2 1 2

2 (95% quantile)positives 3 (99% quantile)

(out of 8) 4 (99.9% quantile)

Summary:

: additive noise (single environment)

: independence-based methods (single environment)

: invariant prediction; control family wise error rate

Future directions (theory):

extend to estimation of graphs

combine with 1 or 2

nonlinear models

finite sample guarantees

Future work (methodology):

domain adaption

instrumental variables I X Y

Thank you!

Summary:

combine with 1 or 2

nonlinear models

domain adaption

Thank you!

Summary:

combine with 1 or 2

nonlinear models

domain adaption

Thank you!

jonas peters, mateo rojas-carulla, amar shah machine ...cbl.eng.cam.ac.uk › ... › readinggroup...

Documents

the impact of chilean fruit sector development on female...

differential privacy -...

defending against adversarial...

informe final acreditación sede...

effective reinforcement learning following cerebellar damage...

segmentation and tracking of multiple humans in...

quantifying the reasons for non-enrollment and...

· carulla abello castilla silvestre chouati ahmed coffi...

sustainable development report the fruit that we harvest...

from single-agent to multi-agent reinforcement learning...

course: some hot topics in contemporary statistics › ......

malaria - massachusetts institute of...

bridging aging and disability networks: "strategies for...

womenandexileinirishliteratureandculture !...

luis salvador-carulla m.d., phd professor of psychiatry....

filejerome bickenbach, christine bigby, luis...

improving public health in india: need for innovative...

cost-effectiveness and mental health - lse home policy...

john cunningham and david knowles machine learning rcc 08...

the effect of contextual cues on the encoding of...