do-search–atoolforcausalinferenceandstudy

arX

iv:2

007.

0818

9v1

[st

at.A

P] 1

6 Ju

l 202

0

Do-search – a tool for causal inference and study

design with multiple data sources

Juha Karvanen

Department of Mathematics and StatisticsUniversity of Jyvaskyla, Finland

Santtu TikkaDepartment of Mathematics and Statistics

University of Jyvaskyla, Finland

Antti HyttinenHIIT, Department of Computer Science

University of Helsinki, Finland

Abstract

Epidemiological evidence is based on multiple data sources including clinical tri-als, cohort studies, surveys, registries and expert opinions. Merging informationfrom different sources opens up new possibilities for the estimation of causal effects.We show how causal effects can be identified and estimated by combining experi-ments and observations in real and realistic scenarios. As a new tool, we presentdo-search, a recently developed algorithmic approach that can determine the identi-fiability of a causal effect. The approach is based on do-calculus, and it can utilizedata with non-trivial missing data and selection bias mechanisms. When the effect isidentifiable, do-search outputs an identifying formula on which numerical estimationcan be based. When the effect is not identifiable, we can use do-search to recognizeadditional data sources and assumptions that would make the effect identifiable.Throughout the paper, we consider the effect of salt-adding behavior on blood pres-sure mediated by the salt intake as an example. The identifiability of this effect isresolved in various scenarios with different assumptions on confounding. There arescenarios where the causal effect is identifiable from a chain of experiments but notfrom survey data, as well as scenarios where the opposite is true. As an illustration,we use survey data from NHANES 2013–2016 and the results from a meta-analysisof randomized controlled trials and estimate the reduction in average systolic bloodpressure under an intervention where the use of table salt is discontinued.

Keywords: Artificial Intelligence, Causality, Clinical Trial, Research Design, Selec-tion Bias, Software, Surveys and Questionnaires

This is a preprint of the article that will appear in Epidemiology.

1

http://arxiv.org/abs/2007.08189v1

1 Introduction

Epidemiological knowledge consists of cumulative evidence on associations and causal re-lations between background variables, risk factors and disease events. Traditional meta-analysis is commonly used to merge information from studies that are sufficiently similaraccording to predefined inclusion criteria (Higgins and Green, 2011). In a wider perspec-tive, we may have a heterogeneous collection of studies available and the question is todecide whether these studies together allow for the causal effect of interest to be identified.

For instance, consider identification of causal effect of X on Y , defined here as thepost-interventional (Pearl, 2009) distribution P (Y | do(X)), in a setting where the effect ismediated through Z. If there is an unobserved confounder between X and Z, the causaleffect is not identifiable from survey data on X , Z and Y . Now assume that we carry outa new experiment where X is intervened on and Z (but not Y ) is measured. By applyingdo-calculus (Pearl, 1995) we can show that the survey and the experiment together makeit possible to identify the causal effect of X on Y (see Section 3.2 for details).

More generally, combining different data sources in a systematic way may be a chal-lenging task. One has to perceive which variables are shared between the data sources,be aware of context-specific differences between the sources, understand the study designand missing data pattern, and recognize potential confounders. Graphical models canhelp to describe this information in an organized manner (Textor et al., 2011; Karvanen,2015; Textor et al., 2016; Matthay and Glymour, 2020). Thereafter it remains to con-clude whether the effect of interest is identifiable from the available data sources underthe specified causal assumptions and if the answer is positive, to estimate the effect.

A theoretical overview of recent developments in this kind of data fusion is presentedby Bareinboim and Pearl (2016). The practical examples include propensity score methodsfor merging observational and experimental data (Tipton, 2013; O’Muircheartaigh and Hedges,2014; Rosenman et al., 2018), methods for causal inference in randomized controlled tri-als (RCTs) nested within cohorts of trial eligible individuals (Dahabreh et al., 2018), andaverage treatment effect estimation for pulmonary artery catheterization combining ex-perimental with observational studies (Hartman et al., 2015). However, these examplesdo not address the general problem of deciding whether a causal effect can be identifiedfrom the available collection of experiments and observational studies in the presence ofselection bias and missing data.

In this paper, we show how a recently developed algorithmic approach based on do-calculus, called do-search (Tikka et al., 2020, 2019a), can be applied in the epidemi-ological practice. Do-search can determine the identifiability of a causal effect whenmultiple data sources available. The data sources may be observational or experimentaland they may suffer from missing data and selection bias. Do-search can be used to de-rive expressions for causal effects and can be utilized in epidemiological research in guidingtop-level study design, evaluating consequences of missing data and merging information.We present examples that illustrate the use of do-search R package (Tikka et al., 2019a)as a part of the process of causal effect estimation.

2

As an illustration, we consider the causal effect of salt-adding behavior (X) on bloodpressure (Y ). Salt-adding behavior consists of the habits of adding salt in cooking and atthe table. The salt intake (Z) is the mediator for the effect. In many countries, publichealth recommendations advise reducing the salt intake from the typical 9–12 g/day to5–6 g/day (WHO, 2003; He et al., 2013). While the salt intake may be difficult to measurein daily life, avoiding to add salt in cooking and at the table is a simple way to reduceit. The effect of salt intake on blood pressure have been established in many studies, andseveral reviews and meta-analyses have been carried out(He and MacGregor, 2002, 2009;Graudal et al., 2012; Aburto et al., 2013; He et al., 2013). For instance, on the basis ofa meta-analysis of 34 RCTs (3230 participants), a 100mmol (6 g) reduction in 24 hoururinary sodium was associated with a fall in systolic blood pressure of 5.8mmHg afteradjustment for age, ethnic group, and blood pressure status (He et al., 2013). In additionto salt intake, other dietary and life style factors as well as genetic factors are known toaffect blood pressure (Poulter et al., 2015). Many of these factors may also be associatedwith low salt preference and salt-adding behavior.

The question of interest is to find when the causal effect of salt-adding behavior (X) onblood pressure (Y ) can be estimated without a direct experiment whereX is intervened andY is measured. We will consider different causal structures and different combinations ofdata sources and show how do-search can be applied to determine the identifiability of thecausal effect in these scenarios. As a real data illustration we combine the meta-analyticalresults mentioned above(He et al., 2013), and observational data from the National Healthand Nutrition Examination Survey (NHANES) 2013–2016 surveys.

2 Methods

2.1 Concepts and Notation

Expert knowledge on the causal mechanisms is an essential element of causal inference, anda causal model is a way to formalize this knowledge. A structural causal model (Pearl,2009) specifies the known or hypothesized causal relations between a set of variables.These relations are often represented by directed acyclic graphs (DAGs) or their semi-Markovian extensions where latent common causes are marked by bidirected arcs betweentwo observed variables (Shpitser and Pearl, 2006) (see Figure 1 for an example).

An intervened variableX is marked with the do-operator as do(X) and the causal effectP (Y | do(X)) denotes the distribution of Y when X is forced to a value by the intervention.A causal effect is identifiable if it can be uniquely determined from the known distributions.When an identifiable causal effect is estimated, these distributions are usually replaced bytheir parametric or nonparametric estimates. In simple cases, identifiability can be checkedby manually applying standard probability calculus and do-calculus (Pearl, 1995). Do-calculus consists of rules for inserting and deleting observations, exchanging observationsand interventions, and inserting and deleting interventions. There exists efficient algo-

3

rithms for determining identifiability for settings where data from a single observationalsource (Shpitser and Pearl, 2006), from multiple domains (Bareinboim and Pearl, 2013),or from surrogate experiments (Bareinboim and Pearl, 2012; Tikka and Karvanen, 2019;Lee et al., 2019) are available. An open-source software implementation for many of thesealgorithms are available as well (Tikka and Karvanen, 2017).

2.2 Data Sources

In the general setup, the available data sources (inputs) include multiple observationaland experimental studies whose respective distributions can be described in a symbolicform, i.e., as an expression such as “P (X,Z, Y )” or “P (Z | do(X))”. Now a causal effectis identifiable if it can be uniquely expressed as a formula using only the inputs and quan-tities derivable from them. For instance, P (X) is directly derivable from P (X,Z, Y ) andP (Z | do(X),W ) is derivable from P (Z | do(X)) if the conditional independence of Z andW givenX is implied in the graph where the incoming edges toX are removed. The rules ofdo-calculus are valid under the general setup but the algorithms (Shpitser and Pearl, 2006;Bareinboim and Pearl, 2013, 2012; Tikka and Karvanen, 2019; Lee et al., 2019) mentionedin Section 2.1 only work in special cases.

Two sources of data are used for a numeric illustration. The meta-analysis of 34 RCTs(He et al., 2013) provides information on the causal effect of salt intake (Z) on bloodpressure (Y ). We use the summary of the original trials as our input data. The summaryreports the mean change in urinary sodium during the study (in mmol/24h) and the meanchange in systolic blood pressure (in mm Hg) for each study. The available study-levelbackground variables (W ) include the mean age, the proportion of males, the proportionof white people and hypertension status (hypertensive or normotensive). The data sourceis written symbolically as P (Y | do(Z),W ).

NHANES 2013–2016 questionnaire data provide information on the salt intake (Z) andsalt-adding behavior (X) in the United States. The participants have recorded the dietaryitems they have consumed on two days and the daily sodium intake has been derivedfrom these items. We use the mean of two sodium measurements. The different units aretransformed using the equality that 100mmol of salt (NaCl) weights 5.844 g and contains2.299 g (39.34mmol) of sodium (Na). In order to measure the salt-adding behavior wederive a salt score that consists of three questions:

1. How often do you add ordinary salt to your food at the table? (Rarely 0, Occasionally1, Very often 2)

2. Did you add any salt to your food at the table yesterday? (No 0, Yes 1), and

3. How often is ordinary salt or seasoned salt added in cooking or preparing foods inyour household? (Never 0, Rarely 1, Occasionally 2, Very often 3).

The salt score is the sum of the values of these three questions and attains values from0 to 6. The common background variables, age, gender, ethnicity (white or non-white)

4

and hypertension status (hypertensive or normotensive), are the same variables W asin the meta-analysis. The additional background variables (H) include education andthe eating out frequency (times per month). This data source is written symbolically asP (X,Z,W,H). The basic demographic variables have been collected for 11488 individualsand the analysis data set contains 9957 individuals who have the salt score, the eating outfrequency (times per month) and at least one measurement of sodium intake available.The sampling weights provided with the data are used in all analyses.

2.3 Examining Identifiability from Multiple Data Sources withDo-search

Do-search is an open-source software that has been designed for non-parametric iden-tification problems when multiple data sources are available (Tikka et al., 2020, 2019a).The algorithm aims to derive the causal effect of interest from the inputs by carryingout a systematic search over the rules of do-calculus and marginalization, conditioningand chain rule multiplication permitted by probability calculus. The algorithm derivesnew identifiable distributions by applying these rules to the distributions that have beengiven as the input or have been identified in the previous steps. This process is re-peated until the algorithm encounters the target distribution or cannot identify any newdistributions. Note that do-search is not related to causal search algorithms in causaldiscovery (Glymour et al., 2019). Since the approach is search-based, the computationalload increases rapidly when the number of variables grows. This can be often mitigatedby grouping similar variables in the graph. Do-search uses heuristics and search-spacereduction techniques that speed up the algorithm in the vast majority of cases.

The formulas returned by do-search are fully non-parametric. The representation ofthe formulas assumes that the variables are discrete but the summations can be changedto integrals if the corresponding variables are continuous. Given an identifying formula,the estimation of the causal effect is a statistical problem for which the full repertoire ofstatistical and machine learning methods is available.

Do-search can also cope with missing data problems. The graph is augmented byadding nodes for measurements and response indicators that specify whether the value ofthe variable is measured or not (Mohan et al., 2013; Karvanen, 2015). A measurement X∗

is linked to the true value X and response indicator RX as follows

X∗ =

{

X, if RX = 1,

NA, if RX = 0,(1)

where NA denotes a missing value. For instance, the input P (X∗, Z∗, Y ∗, RX , RZ , RY )refers to an observational study where variables X , Z and Y suffer from missing data.When missing data are present, do-search uses additional inference rules to take responseindicators into account(Tikka et al., 2020). These rules are not directly related to recent

5

theoretical work on identification under missing data (Mohan et al., 2013; Shpitser et al.,2015; Bhattacharya et al., 2019).

Do-search is sound, meaning that formulas produced for queries found to be iden-tifiable by the algorithm are always correct. Although the rules of the search havebeen shown to be complete in several restricted problem settings (Tikka et al., 2020;Shpitser and Pearl, 2006; Bareinboim and Pearl, 2013, 2012; Lee et al., 2019), they havenot been shown to completely characterize identifiability when the data come from mul-tiple sources. In practice this means that if we wish to confirm that a causal effect isnot identifiable we need to resort to further study of the specific problem to rule out thepossibility that an identifying formula could be derived by some other means.

3 Results

3.1 The Front-Door Setting with Multiple Data Sources

We study graphs where the effect of X on Y is mediated through Z because this struc-ture leads to many interesting scenarios. This is not a restriction for the approach butdo-search is fully applicable also when the graph contains an edge from X to Y or someother graphical structure. The well-know front-door setting (Pearl, 1995) is shown inFigure 1(a). First, consider a scenario where salt-adding behavior, salt intake and bloodpressure have been measured in a population-based survey. If the sample does not sufferfrom selection bias, we have data on the joint distribution P (X,Z, Y ). The causal effectcan be identified by the front-door adjustment formula

P (Y | do(X)) =∑

Z

P (Z |X)∑

X′

P (X ′)P (Y |X ′, Z),

where all marginal and conditional distributions can be estimated from the survey data.There are also other ways to identify P (Y | do(X)). Instead of data on P (X,Z, Y ),

the available data sources could include an experiment that provides information onP (Y | do(Z)) and a survey that provides information on P (X,Z). Applying do-search

we obtainP (Y | do(X)) =

∑

Z

P (Z |X)P (Y | do(Z)), (2)

where the first term can be estimated from the survey and the second term from theexperiment. The R code for deriving this result with do-search is presented in Figure 2.

On the contrary, some other combinations of studies do not lead to identification in thegraph of Figure 1(a). For instance, the collection of three surveys providing information onP (X,Z), P (X, Y ) and P (Z, Y ) and an experiment providing information on P (Z | do(X))is not sufficient to identify P (Y | do(X)) (Formal proofs for non-identifiability are given inAppendix B).

6

X Z Y

(a) Basic front-door.

X Z Y

(b) Unobserved pre-mediator confounding.

X Z Y

(c) Front-door with unobserved pre-mediatorconfounding.

X Z Y

(d) Unobserved post-mediator confounding.

X Z Y

(e) Front-door with unobserved post-mediatorconfounding.

X Z Y

W

(f) Front-door with an observed confounder.

X Z Y

W

(g) Front-door with an observed confounderand unobserved pre-mediator confounding.

Figure 1: Causal models where the causal effect of X on Y is mediated by Z. In theexample, X stands for salt-adding behavior, Z for salt intake, Y for blood pressure andW for common confounders.

7

library(dosearch)

graph <- "X -> ZZ -> YX <-> Y"

data <- "P(Y | do(Z))P(X,Z)"

query <- "P(Y | do(X))"

dosearch(data, query, graph)

------------- Output --------------$identifiable[1] TRUE

$formula[1] "[sum_{Z} [p(Z|X)*p(Y|do(Z))]]"

Figure 2: Example R code on the use do-search to determine identifiability ofP (Y | do(X)) from P (Y | do(Z)) and P (X,Z) under the assumptions encoded in the graphof Figure 1(a). The R codes for the other examples of this paper with do-search are avail-able in Appendix A.

8

Table 1: Identifiability of P (Y | do(X)) from different data sources in the graphs of Fig-ure 1. The data sources are characterized by the underlying theoretical distributions; e.g.,a survey may provide information on P (X, Y, Z) and an experiment may provide informa-tion on P (Y | do(Z)). Symbol Y denotes identifiability and symbol – non-identifiability(indicated by do-search and proven in Appendix B). For graphs a–e of Figure 1 that donot include W , we assume that W is not connected to any other vertices of the graph.

Graph of Figure 1Data sources a b c d e f g

1. P (X, Y, Z) Y – – Y – – –2. P (X,Z), P (Y | do(Z)) Y – – – – – –3. P (Z | do(X)), P (Y | do(Z)) Y Y Y – – – –4. P (Z, Y ), P (Z | do(X)) – Y – – – – –5. P (X,Z), P (X, Y ), P (Z, Y ), P (Z | do(X)) – Y – Y – – –6. P (X, Y, Z,W ) Y – – Y – Y –7. P (X,Z,W ), P (Y | do(Z),W ) Y – – – – Y –8. P (Z | do(X),W ), P (Y | do(Z),W ) Y Y Y – – – –9. P (Z | do(X),W ), P (Y | do(Z),W ), P (W ) Y Y Y – – Y Y

3.2 Variants of the Front-Door Setting

Next we will study identifiability in variants of the basic front-door setting by utilizingdo-search. The graphs for these settings are shown in Figure 1 and the identifiabilityresults for different data sources are summarized in Table 1.

Figure 1(b) shows the scenario described in Section 1. The causal effect P (Y | do(X))is not identifiable from P (X, Y, Z) (line 1 and column b in Table 1). When an experimentproviding information on P (Z | do(X)) and a survey providing information on P (Z, Y ) areavailable (line 4 of Table 1) the causal effect can be identified:

P (Y | do(X)) =∑

Z

P (Z | do(X))P (Y |Z).

If there is unobserved pre-mediator confounding in the front-door setting (Figure 1(c)),neither P (X,Z, Y ) nor P (Y | do(Z)) and P (X,Z) are sufficient to identify P (Y | do(X))(lines 1–2). In this situation, a chain of experiments providing information on P (Z | do(X))and P (Y | do(Z)) (line 3) makes the causal effect identifiable:

P (Y | do(X)) =∑

Z

P (Z | do(X))P (Y | do(Z)). (3)

If instead, we have post-mediator confounding like in Figure 1(d) the situation changesand the chain of experiments (line 3) does not produce identifiability. An intuitive ex-

9

planation for this can be given by looking at the structural equations under interventiondo(X = x):

X = x,

Z = fZ(x, U),

Y = fY (Z, U).

Here U is the unobserved confounder that affects both Z and Y . In two separate exper-iments, P (Z | do(X)) and P (Y | do(Z)), confounder U is not shared between the experi-ments. For this reason, equation (3) does not specify a correct formula for P (Y | do(X))in this case. As an extreme example, let X , Z, Y and U be binary and specify P (U =1) = 0.5, Z = X ⊕ U and Y = Z ⊕ U , where ⊕ stands for the exclusive logical dis-junction. Now, since P (Z | do(X)) = 0.5 and P (Y | do(Z)) = 0.5, equation (3) suggestsP (Y | do(X)) = 0.5 for any value of X and Y . However, intervened value of X perfectlydetermines Y . Naturally, the causal effect of X on Y can be identified from P (X, Y ) (line1) directly as P (Y | do(X)) = P (Y |X).

If post-mediator confounding occurs in the front-door setting (Figure 1(e)), addi-tional data sources such as P (X,Z, Y ) or P (Y | do(Z), X) do not help (lines 1–9) andP (Y | do(X)) is identifiable only from an experiment where X is intervened and Y ismeasured.

Figures 1(f) and 1(g) present variants where covariate W is observed. In Figure 1(f),the causal effect P (Y | do(X)) can be identified from P (X,Z, Y,W ) (line 6) as

∑

Z,W

P (W )P (Z |X,W )∑

X′

P (X ′ |W )P (Y |X ′, Z,W )

or from the combination of an experiment providing information on P (Y | do(Z),W ) anda survey providing information on P (X,Z,W ) (line 7) as

∑

Z,W

P (W )P (Z |X,W )P (Y | do(Z),W ).

However, the causal effect is not identifiable from a chain of experiments providing infor-mation on P (Z | do(X),W ) and P (Y | do(Z),W ) (line 8) unless the marginal distributionP (W ) is also known (line 9). In Figure 1(g), this combination of two experiments and asurvey (line 9) allows the causal effect P (Y | do(X)) to be identified by

∑

Z,W

P (W )P (Z | do(X),W )P (Y | do(Z),W ).

A survey providing information on P (X,Z, Y,W ) (line 6) or the combination of an exper-iment providing information on P (Y | do(Z),W ) and a survey providing information onP (X,Z,W ) (line 7) are not sufficient for identification in this case.

10

3.3 Illustration with Real Data

We aim to estimate the mean change in systolic blood pressure in the US population underan intervention that makes everyone avoid adding salt to their food (in preparation or attable). More technically, the intervention is defined as setting the salt score to zero foreveryone in the NHANES 2013–2016 surveys. Figure 3 presents a causal model for thesituation. Recall that the NHANES data provides information on the observational dis-tribution P (X,Z,H,W ) and the meta-analysis provides information on the experimentaldistribution P (Y | do(Z),W ). The target to be estimated is the causal effect P (Y | do(X)).Applying do-search we obtain

P (Y | do(X)) =∑

Z,H,W

P (H,W )P (Z |X,H,W )P (Y | do(Z),W ). (4)

As we fit a statistical model for the expected value of Y , we write equation (4) in the formwhere the distribution of Y is replaced by expectation

E(Y | do(X)) =∑

Z,H,W

P (H,W )P (Z |X,H,W )E(Y | do(Z),W ). (5)

Formula (5) shows that three models are needed: a model for the joint distributionP (H,W ), a model that explains the salt intake Z by X , H and W , and a model thatexplains the systolic blood pressure Y by Z and W . The first model can be replaced bythe empirical joint distribution of H and W , i.e., calculating the average over the valuesin the data. The second model is estimated from the NHANES 2013–2016 data. We fita linear model for sodium intake with covariates salt score, gender, age, education andethnicity (white or non-white). As formula (5) is non-parametric, the second model couldbe non-linear model as well. The estimated regression coefficients and their confidenceintervals are shown in Table 2. According to the model, the difference in the salt intakebetween salt score values 6 and 0 equals 0.46 g (20.1mmol) of sodium (1.2 g of salt).

The third model is a meta-regression model where the change in systolic blood pressureis explained by the change in urinary sodium, hypertension status (hypertensive or nor-motensive), mean age, the proportion of males and the proportion of participants classifiedas “white”, and the remaining heterogeneity between the studies is modeled by a randomeffect. The estimated regression coefficients and their confidence intervals estimated withthe R package metafor (Viechtbauer, 2010) are shown in Table 2.

The models are combined according to formula (5) using the NHANES samplingweights in the averaging. It is assumed here that the urinary sodium and the sodiumintake measured in NHANES correspond to each other. The estimated average changesin the systolic blood pressure in the whole population and some subgroups are given inTable 3. The confidence intervals for the combined results are calculated by applying non-parametric bootstrap (DiCiccio and Efron, 1996) simultaneously for both data sources.According to the results, a regular salt user (salt score 6) with hypertension could re-duce his or her sodium intake by 7.9mmol (0.46 g, equals 1.2 g of salt) and systolic blood

11

Table 2: The estimated regression coefficients for the linear model explaining the sodiumintake (in mg) and for the meta-regression model the explaining the change in systolicblood pressure (in mm Hg). CI stands for confidence intervals and UNa for urinary sodium.

The model for the sodium intake E(Z |X,H,W )

Parameter Estimate (95% CI)Intercept 2927.7 (2770.9, 3084.6)Salt score (0–6) 77.4 (58.2, 96.5)Eating out (times/month) 54.4 (42.7, 66.2)Age (years) −12.8 (−14.7,−10.9)Gender: male 941.1 (883.5, 998.7)Ethnicity: white 17.7 (−42.4, 77.7)Hypertensive −2.0 (−67.5, 63.5)Education: Less than 9th grade 0 (reference)Education: 9-11th grade 272.5 (145.9, 399.1)Education: High school graduate 282.5 (166.7, 398.3)Education: Some college or AA degree 320.8 (207.8, 433.7)Education: College graduate 371.1 (255.6, 486.7)

The model for the change in systolic blood pressure ∆E(Y | do(Z),W )

Parameter Estimate (95% CI)Mean age (years) −0.076 (−0.126,−0.025)Gender: male (%) 0.008 (−0.027, 0.043)Ethnicity: white (%) 0.039 (0.017, 0.061)Normotensive: Change in UNa (mmol/24h) 0.046 (0.017, 0.075)Hypertensive: Change in UNa (mmol/24h) 0.069 (0.040, 0.098)

12

Salt-adding (X) Salt intake (Z) Blood pressure (Y )

Common confounders (W )Pre-mediator confounders (H)

Figure 3: Causal model for the real data illustration.

Table 3: The estimated average changes in the systolic blood pressure (in mm Hg) if the useof salt in preparation and at table is discontinued. The weighted average treatment effect(WATE) with NHANES weights is estimated for the whole population as well as subgroupsthat prefer adding salt or are hypertensive. The results are obtained by combining a meta-analysis of RCTs and NHANES 2013–2016 survey data.

Quantity Estimate (95% CI)WATE −1.2 (−4.3,0.5)WATE for salt score 4–6 −1.5 (−4.0, 0.1)WATE for salt score 6 −1.9 (−4.2,−0.4)WATE for hypertensive −2.0 (−5.8, 0.3)WATE for hypertensive with salt score 4–6 −2.2 (−5.5,−0.3)WATE for hypertensive with salt score 6 −3.0 (−6.2,−0.9)

pressure by 3.0mmHg on average by discontinuing the use of salt in preparation and attable.

3.4 Scenarios with Selection Bias and Missing Data

Scenarios where some data are missing by design or unintentionally can be analyzed withdo-search as well. For instance, the decision to measure Z may depend on the measure-ments for X and Y . In our example, this could mean that salt intake is measured only fora subgroup where individuals with exceptionally low or high blood pressure are overrep-resented. In addition, variables X and Y may suffer from occasional missing values. Thegraph for this scenario is presented in Figure 4(a). Variables RX , RZ and RY are responseindicators for X , Z and Y , respectively (Section 2.3). The observed data contain variables

13

X Z Y

RX RZ RY

(a) Selective sampling (depending on X andY ) for mediator Z.

XZ

Y

RX

RZ

RY

(b) Case-control design: selective sampling(depending on Y ) for X, Z and Y .

Figure 4: Causal models for the front-door setting with missing data.

X∗, Z∗ and Y ∗, which are defined as in equation (1). A shortcut notation RXZY = 1 isused to denote RX = 1, RZ = 1, RY = 1.

The causal effect P (Y | do(X)) is identifiable under the assumptions encoded in thegraph of Figure 4(a) if information on P (X∗, Y ∗, Z∗, RX , RY , RZ) is available. If Z ismissing by design, we also know P (RZ | do(X, Y )) which is not necessary for identification.The formula obtained by do-search for the causal effect in Figure 4(a) can be presentedas follows

P (Y | do(X)) =∑

Z

[

∑

Y ′

P (Z |X, Y ′, RXZY = 1)P (Y ′ |X,RXY = 1)×

∑

X′

P (X ′ |RXY = 1)P (Z |X ′, Y, RXZY = 1)P (X ′, Y, RXY = 1)

∑

Y ′ P (Z |X ′, Y ′, RXZY = 1)P (X ′, Y ′, RXY = 1)

]

.

The graph in Figure 4(b) represents a case-control design in the front-door setting(Karvanen,2015; Tikka et al., 2020). The selection to the study depends on Y . The causal ef-fect P (Y | do(Z)) is not directly identifiable from P (X∗, Y ∗, Z∗, RX , RY , RZ). Additionalknowledge on the population distribution P (Y ) or on the selection mechanism P (RY | Y )enables do-search to identify the effect. In both cases, a formula for the causal effectP (Y | do(X)) can be presented as

∑

Z

[

∑

Y ′ P (Y ′)P (X,Z |Y ′, RXZY = 1)∑

Z′,Y ′ P (Y ′)P (X,Z ′ |Y ′, RXZY = 1)×

∑

X′

((

∑

Y ′

P (Y ′)P (X ′ |Y ′, RXZY = 1)

)

P (Y )P (X ′, Z |Y,RXZY = 1)∑

Y ′ P (Y ′)P (X ′, Z |Y ′, RXZY = 1)

)]

.

4 Discussion

Do-search offers an effortless way to check identifiability of causal effects. Automatedprocessing saves the time of researchers for tasks where the expert knowledge is neces-sary. We are not aware of other tools that can, for instance, determine identifiability

14

from an arbitrary chain of experiments (Sections 3.1 and 3.2) or solve complicated miss-ing data problems (Section 3.4). Do-search is useful in both planning and analysis ofstudies. In research design, do-search can be used to check whether the new data tobe collected will enable the effect of interest to be identified, or to determine whetheran existing dataset will be beneficial in a secondary analysis when combined with otherdata sources. Although not considered in this paper, do-search can also handle selec-tion diagrams(Bareinboim and Tian, 2015; Tikka et al., 2020) and solve transportabilityproblems(Bareinboim and Pearl, 2014; Tikka et al., 2020) where the data sources mayoriginate from heterogeneous domains.

The key structural assumptions used by do-search are encoded as a causal graph,which can be freely specified by the researcher. This specification requires epidemiologicalexpertise on the subject of interest. In addition to causal relations, the graph has tospecify the understanding on the selection and missing data mechanism. The input datasources are given in easily interpretable symbolic non-parametric form.

There are important issues that are outside the scope of do-search. The researcher hasto evaluate the quality of the data in each study considered. Especially, the researcher hasto decide how to consider variables that aim to measure the same underlying phenomenonbut have different definitions. For example, survey questions “Did you add salt at tableyesterday?” and “Do you usually add salt at table?” both aim to measure salt-addingbehavior but may have differing answers for the same individual.

A theoretical limitation of do-search is that the current set of rules used are not yetcomplete for all missing data problems. Another limitation related to incomplete data isthat the missing data mechanisms may not differ between the data sources. The scalabilityof do-search may be an issue for some applications, because the computational complexityincreases exponentially as the number of variables increases. However, in many cases, itis possible to reduce the computational burden by grouping variables together that havea similar role in the causal model, i.e. having Z to represent a group of variables insteadof a single variable.

Do-search operates fully non-parametrically and cannot utilize parametric assump-tions that could extend identifiability as such. For instance, if the functional relations arelinear, instrumental variables may render non-parametrically non-identifiable effects iden-tifiable (Angrist et al., 1996; Chen et al., 2017). Recently, do-search has been extendedto make use of a particular type of restrictions, known as context-specific independence re-lations (Tikka et al., 2019b). However, non-parametric non-identifiability and parametricidentifiability indicate that the estimation results may be highly sensitive to the parametricassumptions made.

The causal effect of salt-adding behavior on blood pressure was estimated by com-bining observational and experimental data. The obtained estimates are not directlycomparable with earlier results but are well aligned with some related studies whenthe uncertainty is taken into account(Takahashi et al., 2006; He and MacGregor, 2009;Kelly et al., 2016). It is known that the sodium intake measured by a dietary question-

15

naire is prone for recall bias and thus less reliable than the sodium intake measured fromurine (Karppanen and Mervaala, 1998; He and MacGregor, 2009). In our results, the rel-atively wide confidence intervals of the point estimates may reflect the measurement errorof salt intake in NHANES. The analysis could be continued further, and questions suchas the causal effect of salt-adding behavior on the risk of cardiovascular diseases and totalmortality (Aburto et al., 2013; Alderman and Cohen, 2012) could be studied in a similarmanner.

Do-search is a versatile addition to the epidemiological toolbox. Combined with othertools it paths the way towards a holistic approach that goes beyond the traditional meta-analysis and allows for a systematic analysis of cumulative evidence from heterogeneousdata sources.

Acknowledgements

ST was supported by Academy of Finland grant 311877 (Decision analytics utilizing causalmodels and multiobjective optimization). AH was supported by Academy of Finland grant295673. The authors thank Jukka Nyblom for useful comments.

References

Aburto, N. J., Ziolkovska, A., Hooper, L., Elliott, P., Cappuccio, F. P., and Meerpohl, J. J.(2013). Effect of lower sodium intake on health: systematic review and meta-analyses.BMJ, 346:f1326.

Alderman, M. H. and Cohen, H. W. (2012). Dietary sodium intake and cardiovascularmortality: controversy resolved? American Journal of Hypertension, 25(7):727–734.

Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal ef-fects using instrumental variables. Journal of the American Statistical Association,91(434):444–455.

Bareinboim, E. and Pearl, J. (2012). Causal inference by surrogate experiments: z-identifiability. In de Freitas, N. and Murphy, K., editors, Proceedings of the 28th Con-ference on Uncertainty in Artificial Intelligence, pages 113–120. AUAI Press.

Bareinboim, E. and Pearl, J. (2013). A general algorithm for deciding transportability ofexperimental results. Journal of Causal Inference, 1:107–134.

Bareinboim, E. and Pearl, J. (2014). Transportability from multiple environments withlimited experiments: Completeness results. In Advances in Neural Information Process-ing Systems, volume 27, pages 280–288.

16

Bareinboim, E. and Pearl, J. (2016). Causal inference and the data-fusion problem. Pro-ceedings of the National Academy of Sciences, 113(27):7345–7352.

Bareinboim, E. and Tian, J. (2015). Recovering causal effects from selection bias. InProceedings of the 29th AAAI Conference on Artificial Intelligence, pages 3475–3481.

Bhattacharya, R., Nabi, R., Shpitser, I., and Robins, J. M. (2019). Identification inmissing data models represented by directed acyclic graphs. In Proceedings of the 35thConference on Uncertainty in Artificial Intelligence.

Chen, B., Kumor, D., and Bareinboim, E. (2017). Identification and model testing inlinear structural equation models using auxiliary variables. In Proceedings of the 34thInternational Conference on Machine Learning, volume 70, pages 757–766.

Dahabreh, I. J., Robertson, S. E., Tchetgen, E. J. T., Stuart, E. A., and Hernan, M. A.(2018). Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics.

DiCiccio, T. J. and Efron, B. (1996). Bootstrap confidence intervals. Statistical Science,11(3):189–212.

Glymour, C., Zhang, K., and Spirtes, P. (2019). Review of causal discovery methods basedon graphical models. Frontiers in Genetics, 10:524.

Graudal, N. A., Hubeck-Graudal, T., and Jurgens, G. (2012). Effects of low-sodium dietvs. high-sodium diet on blood pressure, renin, aldosterone, catecholamines, cholesterol,and triglyceride (Cochrane Review). American Journal of Hypertension, 25(1):1–15.

Hartman, E., Grieve, R., Ramsahai, R., and Sekhon, J. S. (2015). From sample averagetreatment effect to population average treatment effect on the treated: combining ex-perimental with observational studies to estimate population treatment effects. Journalof the Royal Statistical Society: Series A (Statistics in Society), 178(3):757–778.

He, F. J., Li, J., and MacGregor, G. A. (2013). Effect of longer term modest salt reductionon blood pressure: Cochrane systematic review and meta-analysis of randomised trials.BMJ, 346:f1325.

He, F. J. and MacGregor, G. A. (2002). Effect of modest salt reduction on blood pressure:a meta-analysis of randomized trials. implications for public health. Journal of HumanHypertension, 16(11):761.

He, F. J. and MacGregor, G. A. (2009). A comprehensive review on salt and healthand current experience of worldwide salt reduction programmes. Journal of HumanHypertension, 23(6):363.

17

Higgins, J. and Green, S. (2011). Cochrane Handbook for Systematic Reviews of Interven-tions Version 5.1.0. The Cochrane Collaboration.

Karppanen, H. and Mervaala, E. (1998). Sodium intake and mortality. The Lancet,351(9114):1509.

Karvanen, J. (2015). Study design in causal models. Scandinavian Journal of Statistics,42(2):361–377.

Kelly, J., Khalesi, S., Dickinson, K., Hines, S., Coombes, J. S., and Todd, A. S. (2016).The effect of dietary sodium modification on blood pressure in adults with systolic bloodpressure less than 140 mmhg: a systematic review. JBI database of systematic reviewsand implementation reports, 14(6):196–237.

Lee, S., Correa, J. D., and Bareinboim, E. (2019). General identifiability with arbitrarysurrogate experiments. In Proceedings of the 35th Conference on Uncertainty in ArtificialIntelligence.

Matthay, E. C. and Glymour, M. M. (2020). A graphical catalog of threats to validity:Linking social science with epidemiology. Epidemiology, 31(3):376–384.

Mohan, K., Pearl, J., and Tian, J. (2013). Graphical models for inference with missingdata. In Advances in Neural Information Systems, volume 26, pages 1277–1285.

O’Muircheartaigh, C. and Hedges, L. V. (2014). Generalizing from unrepresentative exper-iments: a stratified propensity score approach. Journal of the Royal Statistical Society:Series C (Applied Statistics), 63(2):195–210.

Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge UniversityPress, second edition.

Poulter, N. R., Prabhakaran, D., and Caulfield, M. (2015). Hypertension. The Lancet,386(9995):801–812.

Rosenman, E., Owen, A. B., and Baiocchi, M. (2018). Propensity score methods formerging observational and experimental datasets. arXiv preprint arXiv:1804.07863.

Shpitser, I., Mohan, K., and Pearl, J. (2015). Missing data as a causal and probabilisticproblem. In Meila, M. and Heskes, T., editors, Proceedings of the 31st Conference onUncertainty in Artificial Intelligence, pages 802–811. AUAI Press.

Shpitser, I. and Pearl, J. (2006). Identification of joint interventional distributions inrecursive semi-Markovian causal models. In Proceedings of the 21st National Conferenceon Artificial Intelligence – Volume 2, pages 1219–1226. AAAI Press.

18

Takahashi, Y., Sasaki, S., Okubo, S., Hayashi, M., and Tsugane, S. (2006). Blood pressurechange in a free-living population-based dietary modification study in japan. Journalof Hypertension, 24(3):451–458.

Textor, J., Hardt, J., and Knuppel, S. (2011). DAGitty: a graphical tool for analyzingcausal diagrams. Epidemiology, 22(5):745.

Textor, J., van der Zander, B., Gilthorpe, M. S., Liskiewicz, M., and Ellison, G. T.(2016). Robust causal inference using directed acyclic graphs: the r package dagitty.International Journal of Epidemiology, 45(6):1887–1894.

Tikka, S., Hyttinen, A., and Karvanen, J. (2019a). dosearch: Causal effect identificationfrom multiple incomplete Data Sources. R package version 1.0.2.

Tikka, S., Hyttinen, A., and Karvanen, J. (2019b). Identifying causal effects via context-specific independence relations. In Advances in Neural Information Processing Systems,volume 32, pages 2800–2810.

Tikka, S., Hyttinen, A., and Karvanen, J. (2020). Causal effect identification frommultiple incomplete data sources: a general search-based approach. Journal ofStatistical Software, Accepted for publication on the condition of a minor revi-sion(https://arxiv.org/abs/1902.01073).

Tikka, S. and Karvanen, J. (2017). Identifying causal effects with the R package causalef-fect. Journal of Statistical Software, 76(12):1–30.

Tikka, S. and Karvanen, J. (2019). Surrogate outcomes and transportability. InternationalJournal of Approximate Reasoning, 108:21–37.

Tipton, E. (2013). Improving generalizations from experiments using propensity scoresubclassification: Assumptions, properties, and contexts. Journal of Educational andBehavioral Statistics, 38(3):239–266.

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journalof Statistical Software, 36(3):1–48.

WHO (2003). Joint WHO/FAO expert consultation on diet, nutrition and the preventionof chronic diseases, volume 916 of WHO Technical Report Series. WHO, Geneva.

19

Appendix A: R code for the examples

library(dosearch)

#####################

# Graphs of Figure 1

graph1a <- "

X -> Z

Z -> Y

Z -> Y

X <-> Y

"

graph1b <- "

X -> Z

Z -> Y

X <-> Z

"

graph1c <- "

X -> Z

Z -> Y

X <-> Y

X <-> Z

"

graph1d <- "

X -> Z

Z -> Y

Z <-> Y

"

graph1e <- "

X -> Z

Z -> Y

Z <-> Y

X <-> Y

"

graph1f <- "

X -> Z

Z -> Y

W -> X

W -> Z

20

W -> Y

X <-> Y

"

graph1g <- "

X -> Z

Z -> Y

W -> X

W -> Z

W -> Y

X <-> Y

X <-> Z

"

graphs <- c(graph1a, graph1b, graph1c, graph1d, graph1e, graph1f, graph1g)

datasources <- c(

"P(X,Y,Z)",

"P(X,Z)

P(Y|do(Z))",

"P(Z|do(X))

P(Y|do(Z))",

"P(Z,Y)

P(Z|do(X))",

"P(X,Z)

P(X,Y)

P(Z,Y)

P(Z|do(X))",

"P(X,Y,Z,W)",

"P(X,Z,W)

P(Y|do(Z),W)",

"P(Z|do(X),W)

P(Y|do(Z),W)",

"P(Z|do(X),W)

P(Y|do(Z),W)

P(W)"

)

query <- "P(Y|do(X))"

n <- length(datasources)

m <- length(graphs)

id <- matrix("NA", n, m)

formula <- matrix("", n, m)

21

for(i in 1:n) {

for(j in 1:m) {

result <- dosearch(datasources[i], query, graphs[j])

id[i, j] <- ifelse(result$identifiable, "Y", "N")

if (result$identifiable) formula[i, j] <- result$formula

}

}

#####################

# Graph of Figure 3

graph3 <- "

X -> Z

Z -> Y

W -> X

W -> Z

W -> Y

H -> X

H -> Z

X <-> Y

H <-> C

"

datasources3 <- c(

"P(X,Z,H,W)

P(Y|do(Z),W)"

)

query3 <- "P(Y|do(X))"

result3 <- dosearch(datasources3, query3, graph3)

#####################

# Graphs of Figure 4

graph4a <- "

X -> Z

Z -> Y

Z -> Y

X <-> Y

X -> R_Z

Y -> R_Z

R_X <-> R_Z

R_X <-> R_Y

22

R_Z <-> R_Y

"

graph4b <- "

X -> Z

Z -> Y

Z -> Y

X <-> Y

Y -> R_Y

R_Y -> R_X

R_Y -> R_Z

R_X <-> R_Z

R_X <-> R_Y

R_Z <-> R_Y

"

graphs4 <- c(graph4a, graph4b)

datasources4 <- c(

"P(X*,Y*,Z*,R_X,R_Y,R_Z)",

"P(X*,Y*,Z*,R_X,R_Y,R_Z)

P(Y)",

"P(X*,Y*,Z*,R_X,R_Y,R_Z)

P(R_Y|Y)"

)

mdxyz <- "R_X : X, R_Y : Y, R_Z : Z"

mdxz <- "R_X : X, R_Z : Z"

md <- c(mdxyz,mdxz,mdxyz,mdxyz)

n4 <- length(datasources4)

m4 <- length(graphs4)

id4 <- matrix("NA", n4, m4)

formula4 <- matrix("", n4, m4)

for(i in 1:n4) {

for(j in 1:m4) {

cat(i,j,"\n")

result <- dosearch(datasources4[i], query, graphs4[j], missing_data = md[i])

id4[i, j] <- ifelse(result$identifiable, "Y", "N")

if (result$identifiable) formula4[i, j] <- result$formula

}

}

23

Appendix B: Proofs

Here we provide proofs that the scenarios marked as non-identifiable in Table 1 really arenon-identifiable. According to the definition, a causal effect is non-identifiable if thereexists two models , M1 and M2, that share the input distributions but differ by the causaleffect of interest.

(1) Data sources: P (X,Z), P (Y | do(Z)). For Figures 1(b), 1(c), 1(f) and 1(g), wedefine:

M1 =

P 1(U = 1) = 1

2

P 1(X = 1 |U = 1) = p

P 1(X = 1 |U = 0) = 1− p

P 1(Z = 1 |X = 0, U) = 1

2

P 1(Z = 1 |X = 1, U = 1) = q

P 1(Z = 1 |X = 1, U = 0) = 1− q

P 1(Y = 1 |Z = 1) = a

P 1(Y = 1 |Z = 0) = b

M2 =

P 2(U = 1) = p

P 2(X = 1 |U) = 1

2

P 2(Z = 1 |X = 0, U) = 1

2

P 2(Z = 1 |X = 1, U = 1) = q

P 2(Z = 1 |X = 1, U = 0) = 1− q

P 2(Y = 1 |Z = 1) = a

P 2(Y = 1 |Z = 0) = b.

.

It follows that

P 1(X = 0, Z) = P 2(X = 0, Z) =1

2

P 1(X = 1, Z = 1) = P 2(X = 1, Z = 1) =qp+ (1− q)(1− p)

2

P 1(X = 1, Z = 0) = P 2(X = 1, Z = 0) =(1− q)p+ q(1− p)

2

P 1(Y = 1 | do(Z = 1)) = P 2(Y = 1 | do(Z = 1)) = a

P 1(Y = 1 | do(Z = 0)) = P 2(Y = 1 | do(Z = 0)) = b.

but

P 1(Y = 1 | do(X = 1)) =∑

Z,U

P 1(Y = 1 |Z)P 1(Z |X = 1, U)P 1(U)

= aq1

2+ a(1− q)

1

2+ b(1− q)

1

2+ bq

1

2

=a

2+

b

2P 2(Y = 1 | do(X = 1)) = aqp+ a(1− q)(1− p) + b(1− q)p+ bq(1− p).

24

Thus P 1(Y = 1 | do(X = 1)) 6= P 2(Y = 1 | do(X = 1)) when a 6= b, p 6= 1

2and q 6= 1

2. For

Figures 1(d) and 1(e), we define:

M1 =

P 1(U = 1) = 1

2

P 1(X = 1) = 1

2

P 1(Z = 1 |X,U = 1) = p

P 1(Z = 1 |X,U = 0) = 1− p

P 1(Y = 1 |Z = 0, U) = 1

2

P 1(Y = 1 |Z = 1, U = 1) = a

P 1(Y = 1 |Z = 1, U = 0) = b

M2 =

P 2(U = 1) = p

P 2(X = 1) = 1

2

P 2(Z = 1 |X,U) = 1

2

P 2(Y = 1 |Z = 0, U) = 1

2.

P 2(Y = 1 |Z = 1, U) = a+b

2.

It follows that

P 1(X,Z) = P 2(X,Z) =1

4.

P 1(Y = 1 | do(Z = 1)) =∑

X,U

P 1(Y = 1 |Z = 1, U)P 1(X)P 1(U)

=∑

U

P 1(Y = 1 |Z = 1, U)P 1(U)

= a1

2+ b

1

2

=a+ b

2p+

a+ b

2(1− p)

= P 2(Y = 1 | do(Z = 1))

and

P 1(Y = 1 | do(Z = 0)) = P 1(Y = 1 | do(Z = 0)) =1

2,

but

P 1(Y = 1 | do(X = 1)) =∑

Z,U

P 1(Y = 1 |Z, U)P 1(Z |X = 1, U)P 1(U)

= ap1

2+ b(1 − p)

1

2+

1

2(1− p)

1

2+

1

2p1

2

=ap

2+

b(1 − p)

2+

1

4

P 2(Y = 1 | do(X = 1)) =a+ b

2

1

2p+

a+ b

2

1

2(1− p) +

1

2

1

2p+

1

2

1

2(1− p)

=a+ b

4+

1

4.

Thus P 1(Y = 1 | do(X = 1)) 6= P 2(Y = 1 | do(X = 1)) when a 6= b and p 6= 1

2.

25

For Figure 1(d), we define:

M1 =

P 1(U = 1) = 1

2

P 1(X = 1) = 2

5

P 1(Z = 1 |X = 1, U = 1) = 2

5

P 1(Z = 1 |X = 1, U = 0) = 7

20

P 1(Z = 1 |X = 0, U = 1) = 3

10

P 1(Z = 1 |X = 0, U = 0) = 2

5

P 1(Y = 1 |Z = 1, U = 1) = 1

5

P 1(Y = 1 |Z = 1, U = 0) = 3

10

P 1(Y = 1 |Z = 0, U = 1) = 3

10

P 1(Y = 1 |Z = 0, U = 0) = 7

20

M2 =

P 2(U = 1) = 1

2

P 2(X = 1) = 2

5

P 2(Z = 1 |X = 1, U = 1) = 1

10

P 2(Z = 1 |X = 1, U = 0) = 7

20

P 2(Z = 1 |X = 0, U = 1) = 1

2

P 2(Z = 1 |X = 0, U = 0) = 1

5

P 2(Y = 1 |Z = 1, U = 1) = 1

5

P 2(Y = 1 |Z = 1, U = 0) = 3

10

P 2(Y = 1 |Z = 0, U = 1) = 3

10

P 2(Y = 1 |Z = 0, U = 0) = 7

20

It follows that

P 1(Y = 1, Z = 1) = P 2(Y = 1, Z = 1) =91

1000

P 1(Y = 1, Z = 0) = P 2(Y = 1, Z = 0) =601

2000

P 1(Y = 0, Z = 1) = P 2(Y = 0, Z = 1) =269

1000

P 1(Y = 0, Z = 0) = P 2(Y = 0, Z = 0) =679

1000

and

P 1(Z = 1 | do(X = 1)) = P 2(Z = 1 | do(X = 1)) =3

8

P 1(Z = 1 | do(X = 0)) = P 2(Z = 1 | do(X = 0)) =7

20,

but

P 1(Y = 1 | do(X = 1)) =63

1606=

57

160= P 2(Y = 1 | do(X = 1)).

(4) Data sources: P (X,Z), P (X, Y ), P (Z, Y ), P (Z | do(X)). For Figures 1(a), 1(c), 1(e),1(f) and 1(g), we use the first construction of (3). It remains to show that P 1(X,Z) =

27

P 2(X,Z) and P 1(X, Y ) = P 2(X, Y ) also hold. A simple computation gives:

P 1(X = 1, Y = 1) = P 1(X = 1, Y = 1) =33

128

P 1(X = 1, Y = 0) = P 1(X = 1, Y = 0) =15

128

P 1(X = 0, Y = 1) = P 1(X = 0, Y = 1) =161

640

P 1(X = 0, Y = 0) = P 1(X = 0, Y = 0) =239

640.

We know that P 1(U) = P 2(U), P 1(X |U) = P 2(X |U) and P 1(Z |X) = P 2(Z |X), whichmeans that P 1(X,Z) = P 2(X,Z) as well.

(5) Data sources: P (X,Z,W ), P (Y | do(Z),W ). For Figures 1(b), 1(c), 1(d), 1(e), thedata sources are essentially the same as in (1) since W is unconnected. Thus the construc-tions of (1) are also applicable here. For Figure 1(g) we define:

M1 =

P 1(U1 = 1) = 1

2

P 1(U2 = 1) = 1

2

P 1(W = 1) = 1

2

P 1(X = 1 |W,U1 = 1, U2) = p

P 1(X = 1 |W,U1 = 0, U2) = 1− p

P 1(Z = 1 |W,X = 0, U1) =1

2

P 1(Z = 1 |W,X = 1, U1 = 1) = q

P 1(Z = 1 |W,X = 1, U1 = 0) = 1− q

P 1(Y = 1 |W,Z = 1, U2) = a

P 1(Y = 1 |W,Z = 0, U2) = b

M2 =

P 2(U1 = 1) = p

P 2(U2 = 1) = 1

2

P 2(W = 1) = 1

2

P 2(X = 1 |W,U1, U2) =1

2

P 2(Z = 1 |W,X = 0, U1) =1

2

P 2(Z = 1 |W,X = 1, U1 = 1) = q

P 2(Z = 1 |W,X = 1, U1 = 0) = 1− q

P 2(Y = 1 |W,Z = 1, U2) = a

P 2(Y = 1 |W,Z = 0, U2) = b.

The parametrization is very similar to (1) and it follows that

P 1(X = 0, Z,W ) = P 2(X = 0, Z,W ) =1

8

P 1(X = 1, Z = 1,W ) = P 2(X = 1, Z = 1,W ) =qp+ (1− q)(1− p)

4

P 1(X = 1, Z = 0,W ) = P 2(X = 1, Z = 0,W ) =(1− q)p+ q(1− p)

4

P 1(Y = 1 | do(Z = 1),W ) = P 2(Y = 1 | do(Z = 1),W ) = a

P 1(Y = 1 | do(Z = 0),W ) = P 2(Y = 1 | do(Z = 0),W ) = b.

28

However

P 1(Y = 1 | do(X = 1)) =∑

Z,W,U1,U2

P 1(Y = 1 |W,Z, U2)P1(Z |X = 1,W, U1)P

1(W )P 1(U1)P1(U2)

=1

8

∑

Z,W,U1,U2

P 1(Y = 1 |Z)P 1(Z |X = 1,W, U1)

=1

2(aq + a(1 − q) + b(1− q) + bq)

=a

2+

b



2and q 6= 1

2.

(7) Data sources: P (Z | do(X),W ), P (Y | do(Z),W ), P (W ). For Figures 1(d) and 1(e),the data sources are essentially the same as in (6) and (2) since W is unconnected. Thusthe construction of (2) is also applicable here.

30

do-search–atoolforcausalinferenceandstudy

Documents