alex dimakis, associate professor, dept. of electrical and computer engineering, university of texas...
TRANSCRIPT
![Page 1: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/1.jpg)
causal inference:a friendly introduction
Alex DimakisUT Austin
based on joint work with
Murat Kocaoglu, Karthik ShanmugamSriram Vishwanath, Babak Hassibi
![Page 2: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/2.jpg)
Overview• What is causal inference
• Interventions and how to design them
• What to do if you cannot intervene
![Page 3: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/3.jpg)
Disclaimer• There are many frameworks of causality
• For time-series: Granger causality
• Potential Outcomes / CounterFactuals framework (Imbens & Rubin)
• Pearl’s structural equation models• aka Causal Graph models
• Additive models, Dawid’s decision-oriented approach, Information Geometry, many others…
![Page 4: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/4.jpg)
![Page 5: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/5.jpg)
![Page 6: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/6.jpg)
Overview• What is causal inference• Directed graphical models and conditional independence• That’s not it.
• Interventions and how to design them
• What to do if you cannot intervene
![Page 7: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/7.jpg)
Independence of random variables
S: Heavy smoker
C: Lung cancer before 60
0 0
1 1
0 1
1 …. 1 ….
Observational data
How to check if S independent from C ?
![Page 8: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/8.jpg)
Joint Pdf and Independence
S: Heavy smoker
C: Lung cancer before 60
0 0
1 1
0 1
1 …. 1 ….
Observational dataS=0 S=1
C=0 30/100 10/100
C=1 20/100 40/100
Joint pdf
How to check if S independent from C ?
![Page 9: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/9.jpg)
Joint Pdf and Independence
S: Heavy smoker
C: Lung cancer before 60
0 0
1 1
0 1
1 …. 1 ….
Observational dataS=0 S=1
C=0 30/100 10/100
C=1 20/100 40/100
Joint pdf
How to check if S independent from C ?Compare P(S,C) with P(S)P(C)
0.4
0.6
0.5 0.5
![Page 10: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/10.jpg)
Directed graphical models
A B CGiven data on A,B,C we can estimate the joint PDF p(A,B,C)
See if it factorizes as P(A,B,C)= P(A) P(B|A) P (C|B)i.e. has some conditional indepedencies.
A directed graphical model describes all distributions that have a given set of conditional independencies.
This one: A C |B⫫P(C|A,B) = P(C| B) P(A,C|B) = P(A|B) P(C|B)
A B C
0 1 0
1 1 1
… … …
![Page 11: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/11.jpg)
Directed graphical models
A B CGiven data on A,B,C we can estimate the joint PDF p(A,B,C)
See if it factorizes as P(A,B,C)= P(A) P(B|A) P (C|B)i.e. has some conditional indepedencies.
A directed graphical model describes all distributions that have a given set of conditional independencies.
This one: A C |B⫫P(C|A,B) = P(C| B) P(A,C|B) = P(A|B) P(C|B)
A B C
0 1 0
1 1 1
… … …
• learning a directed graphical model =
learning all conditional independencies in
data.
• learning a causal graph is not learning a
directed graphical model.
![Page 12: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/12.jpg)
Smoking causes cancer
S: Heavy smoker C: Lung cancer before 60
0 0
1 1
0 1
1 …. 1 ….
Observational dataS=0 S=1
C=0 30/100 10/100
C=1 20/100 40/100
Joint pdf
![Page 13: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/13.jpg)
Causality= mechanism
S C Pr(S,C)
![Page 14: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/14.jpg)
Causality= mechanism
S C
S=0 0.5
S=1 0.5
Pr(S)
S=0 S=1
C=0 30/50 10/50
C=1 20/50 40/50
Pr(C/S)
Pr(S,C)
![Page 15: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/15.jpg)
Universe 1
S C
S=0 0.5
S=1 0.5
Pr(S)
S=0 S=1
C=0 30/50 10/50
C=1 20/50 40/50
Pr(C/S)
Pr(S,C) C=F(S,E)E S⫫
![Page 16: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/16.jpg)
Universe 2
S C
![Page 17: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/17.jpg)
Universe 2
S C
C=0 0.4
C=1 0.6
Pr(C)
C=0 C=1
S=0 30/(100*0.4) = 0.75 20/(100*0.6) = 0.33
S=1 10/(100*0.4) = 0.25 40/(100*0.6) = 0.66
Pr(S/C)
Pr(S,C) S=F(C,E)E C⫫
![Page 18: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/18.jpg)
How to find the causal direction?
Pr(S,C)
Pr(S) Pr(C/S)
S CC=F(S,E)E S⫫
![Page 19: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/19.jpg)
How to find the causal direction?
S C
Pr(S,C)
Pr(C) Pr(S/C)Pr(S) Pr(C/S)
S CC=F(S,E)E S⫫ S=F’(C,E’
)E’ S⫫
![Page 20: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/20.jpg)
How to find the causal direction?
S C
Pr(S,C)
Pr(C) Pr(S/C)Pr(S) Pr(C/S)
S CC=F(S,E)E S⫫ S=F’(C,E’
)E’ S⫫
• It is impossible to find the true causal direction from observational data for two random variables.
• (Unless we make more assumptions)
• You need interventions, i.e. messing with the mechanism. • For more than two r.v.s there is a rich theory and some
directions can be learned without interventions. (Spirtes et al.)
![Page 21: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/21.jpg)
Overview• What is causal inference• Directed graphical models and conditional independence• That’s not it.
• Interventions and how to design them
• What to do if you cannot intervene
![Page 22: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/22.jpg)
Intervention: force people to smoke
S C
S=0 0.5
S=1 0.5
Pr(S)
S=0 S=1
C=0 30/50 10/50
C=1 20/50 40/50
Pr(C/S)
• Flip coin and force each person to smoke or not, with prob ½. • In Universe1 (i.e. Under S→C) , • new joint pdf stays same as before intervention.
![Page 23: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/23.jpg)
Intervention: force people to smoke
• Flip coin and force each person to smoke or not, with prob ½. • In Universe 2 (Under C→S) • S, C will become independent after intervention.
C=0 0.4
C=1 0.6
Pr(C)
C=0 C=1
S=0 30/(100*0.4) = 0.75 20/(100*0.6) = 0.33
S=1 10/(100*0.4) = 0.25 40/(100*0.6) = 0.66
Pr(S/C)
S C
![Page 24: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/24.jpg)
Intervention: force people to smoke
• Flip coin and force each person to smoke or not, with prob ½. • In Universe 2 (Under C→S) • S, C will become independent after intervention. • So check correlation on data after intervention and find true
direction!
C=0 0.4
C=1 0.6
Pr(C)
C=0 C=1
S=0 30/(100*0.4) = 0.75 20/(100*0.6) = 0.33
S=1 10/(100*0.4) = 0.25 40/(100*0.6) = 0.66
Pr(S/C)
S C
![Page 25: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/25.jpg)
who does interventions like that?
-you’re giving dying people sugar pills?
![Page 26: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/26.jpg)
More variables
S2 S7
S1 S3
S4
S6
S5
True Causal DAG
![Page 27: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/27.jpg)
More variables
S2 S7
S1 S3
S4
S6
S5
True Causal DAG
From observationalData we can learnConditional independencies. Obtain Skeleton (lose directions)
![Page 28: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/28.jpg)
More variables
S2 S7
S1 S3
S4
S6
S5
S2 S7
S1 S3
S4
S6
S5
True Causal DAG Skeleton
From observationalData we can learnConditional independencies. Obtain Skeleton (lose directions)
![Page 29: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/29.jpg)
PC Algorithm (Spirtes et al. Meek)
S2 S7
S1 S3
S4
S6
S5
Skeleton
There are a few directions we can learn from observational Data(Immoralities, Meek Rules)
Spirtes, Glymour, Scheines 2001, PC Algorithm C. Meek , 1995. Andersson, Madigan, Perlman, 1997
![Page 30: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/30.jpg)
PC Algorithm (Spirtes et al. Meek)
S2 S7
S1 S3
S4
S6
S5
Skeleton
There are a few directions we can learn from observational Data(Immoralities, Meek Rules)
Spirtes, Glymour, Scheines 2001, PC Algorithm C. Meek , 1995. Andersson, Madigan, Perlman, 1997
![Page 31: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/31.jpg)
How interventions reveal directions
S2 S7
S1 S3
S4
S6
S5
Intervened Set S={S1,S2,S4}
We choose a subset of the variables S and Intervene (i.e. force random values )
![Page 32: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/32.jpg)
How interventions reveal directions
S2 S7
S1 S3
S4
S6
S5
Intervened Set S={S1,S2,S4}
We choose a subset of the variables S and Intervene (i.e. force random values )
Directions of edges between S and Sc are revealed to me.
![Page 33: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/33.jpg)
How interventions reveal directions
S2 S7
S1 S3
S4
S6
S5
Intervened Set S={S1,S2,S4}
We choose a subset of the variables S and Intervene (i.e. force random values )
Directions of edges between S and Sc are revealed to me.
Re-apply PC Algorithm+Meek rules to learn a few more edges possibly
![Page 34: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/34.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
Skeleton
Given a skeleton graph, how many interventions are needed to learn all directions ?
• A-priori fixed set of interventions (non-Adaptive)
![Page 35: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/35.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
Skeleton
Given a skeleton graph, how many interventions are needed to learn all directions ?
• A-priori fixed set of interventions (non-Adaptive)
• Adaptive
• Randomized Adaptive
![Page 36: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/36.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
Skeleton
Given a skeleton graph, how many interventions are needed to learn all directions ?
• A-priori fixed set of interventions (non-Adaptive)
Theorem (Hauser & Buhlmann 2014): Log(χ) interventions suffice (χ= chromatic number of skeleton)
Adaptive?(NIPS15): Adaptive does not help (in the worst case)
• Randomized Adaptive (Li,Vetta, NIPS14): loglog(n) interventions with high probability suffice for complete skeleton.
![Page 37: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/37.jpg)
A good algorithm for general graphs
![Page 38: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/38.jpg)
Overview• What is causal inference
• Interventions and how to design them
• What to do if you cannot intervene• Make more assumptions• compare on standard benchmark
![Page 39: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/39.jpg)
Data-driven causality• How to find causal direction without interventions
• Impossible for two variables. Possible under assumptions.
• Popular assumption Y= F(X) + E, (E X) ⫫(Additive models)(Shimizu et al., Hoyer et al., Peters et al. Chen et al., Mooij et al.)
• Entropic Causality: Use information theory for general data-driven causality. Y= F(X,E), (E X) ⫫
• (related work: Janzing, Mooij, Zhang, Lemeire: not additive assumption but no noise. Y=F(X) )
![Page 40: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/40.jpg)
Conclusions• Learning causal graphs with interventions is an on-going field of
research
• Tetrad project (CMU)• http://www.phil.cmu.edu/projects/tetrad/
• When time is present more things can be done (Difference in Differences method, Granger, Potential outcomes etc.)
• Additive models and entropic causality can give be used for data-driven causal inference.
![Page 41: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/41.jpg)
Pointers• Tuebingen Benchmark: https://webdav.tuebingen.mpg.de/cause-effect/• http://www.phil.cmu.edu/projects/tetrad/• https://github.com/mkocaoglu/Entropic-Causality• P. Spirtes, C. Glymour and R. Scheines, Causation, Prediction, and Search. Bradford Books, 2001.• Causality by J. Pearl Cambridge University Press, 2009. • Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, G. Imbens and D. Rubin • https://www.youtube.com/watch?v=9yEYZURoE3Y&feature=youtu.be
CCD Summer Short Course 2016CMU Center for Causal Discovery short course.• Jonas Peters, Peter Buehlmann and Nicolai Meinshausen (2016) Causal inference using invariant prediction: identification and
confidence intervals Journal of the Royal Statistical Society, Series B
• Learning Causal Graphs with Small Interventions K. Shanmugam, M. Kocaoglu, A.G. Dimakis, S. Vishwanath (NIPS 2015)
• Jonas Peters, Peter Buehlmann and Nicolai Meinshausen (2016) Causal inference using invariant prediction: identification and confidence intervals Journal of the Royal Statistical Society, Series B
• Frederich Eberhardt, Clark Glymour, and Richard Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among n variables.
• Alain Hauser and Peter Buhlmann. Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926–939, 2014.
• Hoyer, Patrik O., et al. "Nonlinear causal discovery with additive noise models." Advances in neural information processing systems. 2009.
• Janzing, Dominik, et al. "Information-geometric approach to inferring causal directions." Artificial Intelligence 182 (2012)• Peters, Jonas, Dominik Janzing, and Bernhard Scholkopf. "Causal inference on discrete data using additive noise models.”
IEEE Transactions on Pattern Analysis and Machine Intelligence 33.12 (2011)
![Page 42: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/42.jpg)
fin
![Page 43: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/43.jpg)
Learning Causal DAGsTheorem: Log(χ) interventions suffice Proof: 1.Color the vertices. (legal coloring)
S2 S7
S1 S3
S4
S6
S5
Skeleton
![Page 44: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/44.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
Skeleton
Thm: Log(χ) interventions suffice Proof: 1.Color the vertices.
2. Form table with binary representations of colors
Red: 0 0 Green: 0 1 Blue: 1 0
S1 0 0
S2 0 1
S3 1 0
S4 0 1
S5 1 0
S6 0 1
S7 1 0
![Page 45: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/45.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
Skeleton
Thm: Log(χ) interventions suffice Proof: 1.Color the vertices.
2. Form table with binary representations of colors
Red: 0 0 Green: 0 1 Blue: 1 0
3. Each interventionis indexed by a columnof this table.
S1 0 0
S2 0 1
S3 1 0
S4 0 1
S5 1 0
S6 0 1
S7 1 0
Intervention 1
![Page 46: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/46.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
For any edge, its two vertices have different colors. Their binary reps are different in 1 bit.So for some intervention, one is in set and other is not. So I will learn its direction. ΟΕΔ.
Thm: Log(χ) interventions suffice Proof: 1.Color the vertices.
2. Form table with binary representations of colors
Red: 0 0 Green: 0 1 Blue: 1 0
3. Each interventionis indexed by a columnof this table.
S1 0 0
S2 0 1
S3 1 0
S4 0 1
S5 1 0
S6 0 1
S7 1 0
Intervention 1
![Page 47: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/47.jpg)
Learning Causal DAGs
S2 S7
S1 S3
S4
S6
S5
Skeleton
On-going Research on several problems
• What if the size of the intervention sets is limited (NIPS 15)
• What if some variables cannot be intervened on
![Page 48: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/48.jpg)
Major problem: Size of interventions
S2 S7
S1 S3
S4
S6
S5
Intervened Set S={S1,S2,S4}
We choose a subset of the variables S and Intervene (i.e. force random values )
Question: If each intervention has size up to k, how many interventions do we need ?
Eberhardt: A separating system on χ elements with weight k is sufficient to produce a non-adaptive causal inference algorithm
A separating system on n elements with weight k is a {0,1} matrix with n distinct columns and each row having weight at most k. Reyni, Kantona, Wegener: (n,k) separating systems have size
![Page 49: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/49.jpg)
Major problem: Size of interventions
S2 S7
S1 S3
S4
S6
S5
Intervened Set S={S1,S2,S4}
Open problem: Is a separating system necessary or can adaptive algorithms do better ?
(NIPS15): For complete graph skeletons, separating systems are necessary.Even for adaptive algorithms.
We can use lower bounds on size of separating systems to get lower bounds on the number of interventions.
Randomized adaptive: loglogn interventions
Our result: n/k loglog k interventions suffice , each of size up to k.
![Page 50: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/50.jpg)
Entropic Causality• Extra slides
![Page 51: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/51.jpg)
Entropic Causality • Given data Xi,Yi. • Search over explanations assuming X→Y• Y= F(X,E) , (E X)⫫• Simplest explanation: One that minimizes H(E).
• Search in the other direction, assuming Y→X • X= F’(Y,E’) , (E’ Y)⫫• If H(E’) << H(E) decide Y→X • If H(E) <<H(E’) decide X→Y• If H(E), H(E’) close, say ‘don’t know’
![Page 52: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/52.jpg)
Entropic Causality in pictures
S C S C
C= F(S,E) , (E S)⫫H(E) small
S= F’(C,E’) , (E’ C)⫫H(E’) big
![Page 53: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/53.jpg)
Entropic Causality in pictures
S C S C
C= F(S,E) , (E S)⫫H(E) small
S= F’(C,E’) , (E’ C)⫫H(E’) big
• You may be thinking that min H(E) is like minimizing H(C/S).
• But it is fundamentally different
• (we’ll prove its NP-hard to compute)
![Page 54: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/54.jpg)
Question 1: Identifiability?• If data is generated from X→Y ,
i.e. Y= f(X,E), (E X) ⫫ and H(E) is small. • Is it true that all possible reverse explanations • X= f’(Y,E’) , (E’ Y)⫫
must have H(E’) big, for all f’,E’ ?
• Theorem 1: If X,E,f are generic, then identifiability holds for H0 (support of distribution of E’ must be large).
• Conjecture 1: Same result holds for H1 (Shannon entropy).
![Page 55: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/55.jpg)
Question 2: How to find simplest explanation?• Minimum entropy coupling problem: Given some
marginal distributions U1,U2, .. Un , find the joint distribution that has these as marginals and has minimal entropy. • (NP-Hard, Kovacevic et al. 2012).
• Theorem 2: Finding the simplest data explanation f,E, is equivalent to solving the minimum entropy coupling problem. • How to use: We propose a greedy algorithm that
empirically performs reasonably well
![Page 56: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/56.jpg)
Proof idea• Consider Y = f(X, E). (X,Y over n sized alphabet.) • pi,j =P(Y = i|X=j) = P(f(X,E) = i | X = j) = P( fj(E) = i ) since E ⫫
Xe1
e2
e3
e4
e5
e6
.
.
.em
Distribution of E
p1,1
p2,1
p3,1
.
.
.
pn,1
Distribution of Y conditioned on X = 1
f1• Each conditional probability is a
subset sum of distribution of E• Si,j: index set for pi,j
![Page 57: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/57.jpg)
Performance on Tubingen dataset
• Decision rate: • Fraction of pairs that algorithm makes a
decision.
• Decision made when |H(X,E)-H(Y,E’)|> t(t determines the decision rate)
• Confidence intervals based on number of datapoints
• Slightly better than ANMs
![Page 58: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/58.jpg)
Conclusions 2• Introduced a new framework for data-driven causality for two variables
• Established Identifiability for generic distributions for H0 entropy. Conjectured it holds for Shannon entropy.
• Inspired by Occam’s razor. Natural and different from prior works.
• Natural for categorical variables (Additive models do not work there)
• Proposed practical greedy algorithm using Shannon entropy.
• Empirically performs very well for artificial and real causal datasets.
![Page 59: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/59.jpg)
Existing Theory: Additive Noise Models• Assume Y = f(X)+E, X E⫫• Identifiability 1: • If f nonlinear, then ∄ g, N Y such that X = g(Y)+N (almost ⫫
surely)• If E non-Gaussian, ∄ g, N Y such that X = g(Y)+N⫫
• Performs 63% on real data*• Drawback: Additivity is a restrictive functional
assumption
* Cause Effect Pairs Dataset: https://webdav.tuebingen.mpg.de/cause-effect/
![Page 60: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/60.jpg)
Existing Theory: Independence of Cause and Mechanism• Function f chosen “independently” from distribution of X
by nature• Notion of independence: Assign a variable to f, check log-
slope integral• Boils down to: X causes Y if h(Y) < h(X) [h:
differential entropy]• Drawback: • No exogenous variable assumption (deterministic X-Y relation)• Continuous variables only
![Page 61: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/61.jpg)
Our Approach• Consider discrete variables X, Y, E.• Use total input (Renyi) entropy as a measure of
complexity• Choose the simpler model• Assumption: (Renyi) entropy of exogenous variable E is
small• Theoretical guarantees for H0 Renyi entropy (cardinality)Causal direction (almost surely) identifiable if E has small cardinality
![Page 62: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/62.jpg)
Performance of Greedy Joint Entropy Minimization
• n marginal distributions each with n states are randomly generated for each n
• The minimum joint entropy obtained by the greedy algorithm is at most 1 bit away from the largest marginal maxiH(Xi)
![Page 63: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/63.jpg)
ResultsShannon Entropy-based Identifiability
• Generate distributions of X,Y by randomly selecting f, X, E.
• Probability of success is the fraction of points where H(X,E) < H(Y,N).
• Larger n drives probability of success to 1 when $H(E) < log(n), supporting the conjecture.
![Page 64: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/64.jpg)
Characterization of Conditionals• Define conditional distribution• Let p = [p1
T, p2T, …, pn
T]T. Then
Ex.: where M is a block partition matrix:Each block of length n is a partitioning of columns
![Page 65: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/65.jpg)
General Position Argument• Suppose Y|X = j are uniform over simplex (not realistic, toy
example)• Note: Let xi ∼ exp(1). Then following is a uniform random vector
over the simplex:
• Drop n rows of p to make it (almost) i.i.d.• Claim: There does not exist an e with H0 < n(n-1)• Proof: Assume otherwise.• Rows of M are linearly dependent.• ∃ a such that aT M = 0• Then aTp = 0• Implies a random hyperplane being orthogonal to a vector, has
probability 0.
![Page 66: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/66.jpg)
Our contribution• Nature chooses X, E, f. Joint distribution over X, Y
implied• Choose X, E randomly over simplex. • Derive X|Y from induced joint• Any Y for which X = g(Y, ) implies ⫫
• Corresponds to a non-zero polynomial being zero, has probability 0.
![Page 67: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/67.jpg)
Formal Result• X, Y discrete r.v.’s with cardinality n• Y = f(X,E) where E X is also discrete⫫• f is generic (technical condition to avoid edge cases, true
in real data)• Distribution vectors of X, E uniformly randomly sampled
from simplex• Then with probability 1, there does not exist N ⫫ Y such
that there exist g that satisfies X = g(Y, N)
![Page 68: Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineering, University of Texas at Austin at MLconf SF 2016](https://reader035.vdocument.in/reader035/viewer/2022070603/58727c231a28abc7068b5635/html5/thumbnails/68.jpg)
Working with Shannon Entropy• Given Y|X, finding E with minimum Shannon entropy
such that there is f that satisfies Y = f(X,E) is equivalent to • Given marginal distributions of n variables Xi, find the
joint distribution with minimum entropy• NP hard problem.• We propose a greedy algorithm (that produces a local
optimum)