Making inferences using random forestsand its application to precision medicine
Hunyong Cho
April 2018
Background
What Machine Learning techniques cannot do
I High performance in prediction, but
I A valid statistical inference lacks.“Can we use Neural Networks to support the effectiveness of a newdrug?”
I Statisticians are making efforts on this.
Making inferences using Random Forests 2/21 Hunyong Cho
Literature
Some efforts on making inference with random forest models
I Consistent random forestsLin and Jeon (2006), Meinshausen (2006), Scornet et al. (2015)
I Simplified forestsBiau (2012), Biau et al. (2008)
I Variance estimatorsMentch and Hooker (2016), Sexton and Laake (2009)
I Stefan Wager & Susan Athey (2017) provedconsistency and asymptotic normality of random forestsof certain kinds.
Making inferences using Random Forests 3/21 Hunyong Cho
How can random forests be used for inference?
Consistency and Asymptotic normality
ConsistencyUnder certain conditions,
|E[µ̂(x)]− µ(x)| = O(s−π log((1−α)−1)
2d log(α−1)
),
where s is the subsample size of member trees, α ≤ 0.2.
Asymptotic NormalityUnder certain conditions,
µ̂n(x)− µ(x)
σn(x)→d N (0, 1) for a sequence σn(x)→ 0,
V̂IJ(x)/σ2n(x)→p 1,
where sn � nβ for some βmin < β < 1,V̂IJ(x) is the infinitesimal jackknife variance estimator.
Making inferences using Random Forests 4/21 Hunyong Cho
How can random forests be used for inference?
Consistency and Asymptotic normality
ConsistencyUnder certain conditions,
|E[µ̂(x)]− µ(x)| = O(s−π log((1−α)−1)
2d log(α−1)
),
where s is the subsample size of member trees, α ≤ 0.2.
Asymptotic NormalityUnder certain conditions,
µ̂n(x)− µ(x)
σn(x)→d N (0, 1) for a sequence σn(x)→ 0,
V̂IJ(x)/σ2n(x)→p 1,
where sn � nβ for some βmin < β < 1,V̂IJ(x) is the infinitesimal jackknife variance estimator.
Making inferences using Random Forests 4/21 Hunyong Cho
How can random forests be used for inference?
Consistency and Asymptotic normality
ConsistencyUnder certain conditions,
|E[µ̂(x)]− µ(x)| = O(s−π log((1−α)−1)
2d log(α−1)
),
where s is the subsample size of member trees, α ≤ 0.2.
Asymptotic NormalityUnder certain conditions,
µ̂n(x)− µ(x)
σn(x)→d N (0, 1) for a sequence σn(x)→ 0,
V̂IJ(x)/σ2n(x)→p 1,
where sn � nβ for some βmin < β < 1,V̂IJ(x) is the infinitesimal jackknife variance estimator.
Making inferences using Random Forests 4/21 Hunyong Cho
Preliminaries - TreesTree
I Partition the feature space into two at each node.
I Exhaustively search best feature and best cut point:“best” in terms of minimizing MSE (regression) or node impurity(classification)
I Recursion and stopping rule (eg. node size, deviance reduction)T (x ; ξ,Z1, ...,Zn)
Random forest (Breiman, 2001)
1. Bootstrap or randomly subsample the data.
2. Build a tree by finding best splits among random set of features ateach node.
3. Repeat Steps 1 and 2 to have B trees
4. Aggregate the B trees.RF (x ;Z1, ...,Zn) = 1
B
∑Bb=1 T (x ; ξ∗b ,Z
∗b1, ...,Z
∗bs),
Biau and Scornet (2016) gave a good review on theoretical andmethodological development of random forests.Making inferences using Random Forests 5/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
I Conventional k-NN
Figure 1: Illustration of a Euclidean 6-NN
I k-PNN (POTENTIAL nearest neighbor. Lin & Jeon, 2006)
Figure 2: Illustration of a 1-PNN
I Trees are k-PNN predictors.Thus, they can be written as T (x) =
∑ni SiYi .
Making inferences using Random Forests 6/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
I Conventional k-NN
Figure 1: Illustration of a Euclidean 6-NN
I k-PNN (POTENTIAL nearest neighbor. Lin & Jeon, 2006)
Figure 2: Illustration of a 1-PNN
I Trees are k-PNN predictors.Thus, they can be written as T (x) =
∑ni SiYi .
Making inferences using Random Forests 6/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
I Conventional k-NN
Figure 1: Illustration of a Euclidean 6-NN
I k-PNN (POTENTIAL nearest neighbor. Lin & Jeon, 2006)
Figure 2: Illustration of a 1-PNN
I Trees are k-PNN predictors.Thus, they can be written as T (x) =
∑ni SiYi .
Making inferences using Random Forests 6/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
I Conventional k-NN
Figure 1: Illustration of a Euclidean 6-NN
I k-PNN (POTENTIAL nearest neighbor. Lin & Jeon, 2006)
Figure 2: Illustration of a 1-PNN
I Trees are k-PNN predictors.Thus, they can be written as T (x) =
∑ni SiYi .
Making inferences using Random Forests 6/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
RF - Ensembel of Potential Nearest Neighbor Predictors
I Random forests consist of Trees trained on random subsamples.
...
I ⇒ R̂F n(x) =∑n
i wi (x)Yi
Making inferences using Random Forests 7/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
RF - Ensembel of Potential Nearest Neighbor Predictors
I Random forests consist of Trees trained on random subsamples.
...
I ⇒ R̂F n(x) =∑n
i wi (x)Yi
Making inferences using Random Forests 7/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
RF - Ensembel of Potential Nearest Neighbor Predictors
I Random forests consist of Trees trained on random subsamples.
...
I ⇒ R̂F n(x) =∑n
i wi (x)Yi
Making inferences using Random Forests 7/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
RF - Ensembel of Potential Nearest Neighbor Predictors
I Random forests consist of Trees trained on random subsamples.
...
I ⇒ R̂F n(x) =∑n
i wi (x)Yi
Making inferences using Random Forests 7/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
RF - Ensembel of Potential Nearest Neighbor Predictors
I Random forests consist of Trees trained on random subsamples.
...
I ⇒ R̂F n(x) =∑n
i wi (x)Yi
Making inferences using Random Forests 7/21 Hunyong Cho
Preliminaries - Potential Nearest Neighbors
RF - Ensembel of Potential Nearest Neighbor Predictors
I Random forests consist of Trees trained on random subsamples.
...
I ⇒ R̂F n(x) =∑n
i wi (x)Yi
Making inferences using Random Forests 7/21 Hunyong Cho
Preliminaries - HonestyConventional trees
I Construct partitions using whole data.
I Estimate the mean again using the same data.
I May tend to bias.
Honest trees (Athey and Imbens, 2016)
I Split data into half and half.
I Construct partitions (i.e. determine splits) using a half.
I Estimate the mean using another half.
Honest trees - Properties
I Guanrantees Si and Yi are independent.⇒ Enables quantifying var(T ) (for Consistency and Normality).
I May lose some efficiency(A single tree does not utilize the whole information in estimation.)
I But better in terms of MSE (because of reduced bias.)
Making inferences using Random Forests 8/21 Hunyong Cho
Preliminaries - HonestyConventional trees
I Construct partitions using whole data.
I Estimate the mean again using the same data.
I May tend to bias.
Honest trees (Athey and Imbens, 2016)
I Split data into half and half.
I Construct partitions (i.e. determine splits) using a half.
I Estimate the mean using another half.
Honest trees - Properties
I Guanrantees Si and Yi are independent.⇒ Enables quantifying var(T ) (for Consistency and Normality).
I May lose some efficiency(A single tree does not utilize the whole information in estimation.)
I But better in terms of MSE (because of reduced bias.)
Making inferences using Random Forests 8/21 Hunyong Cho
Preliminaries - HonestyConventional trees
I Construct partitions using whole data.
I Estimate the mean again using the same data.
I May tend to bias.
Honest trees (Athey and Imbens, 2016)
I Split data into half and half.
I Construct partitions (i.e. determine splits) using a half.
I Estimate the mean using another half.
Honest trees - Properties
I Guanrantees Si and Yi are independent.⇒ Enables quantifying var(T ) (for Consistency and Normality).
I May lose some efficiency(A single tree does not utilize the whole information in estimation.)
I But better in terms of MSE (because of reduced bias.)Making inferences using Random Forests 8/21 Hunyong Cho
Preliminaries - Random split
Conventional trees
I Not all the features are guaranteed a positive chance of being split.
Random split trees
I Random-split trees give a minimum amount of positive chance ofsplit to every feature.“Every step of the tree-growing procedure, the probability that thenext split occurs along the j−th feature is bounded below by π/dfor some 0 < π ≤ 1, for all j = 1, ..., d .”
Random split tree - properties
I The size of partition is arbitrarily small for all aspect of features forn >> 1. (⇒ Key for Consistency)
Making inferences using Random Forests 9/21 Hunyong Cho
Preliminaries - Other concepts
α−regular tree
I A tree is called α−regular if
I after each split, both of the daughter nodes have at least a fractionα of the training sample, and
I the tree is fully grown with depth k for some k ∈ N.(i.e. each terminal node size is in [k , 2k − 1])
symmetric tree
I A predictor is symmetric if
I the output of the predictor is independent of the order of the sampleindex.(i.e. µ̂(x ;X1,X2, ...,Xn) = µ̂(x ;Xh1 ,Xh2 , ...,Xhn) for any permutationh ≡ (h1, ..., ) of (1, ..., n))
Making inferences using Random Forests 10/21 Hunyong Cho
Consistency
Theorem 1. Infinitesimal leavesSuppose that a tree, T , is regular and random-split, and thatX1, ...,Xs ∼ind U([0, 1]d). Then for any 0 < η < 1 and for large enough s,
P
diamj(L(x)) ≥(
s
2k − 1
)− 0.99π(1−η) log((1−α)−1)
d log(α−1)
≤ ( s
2k − 1
)− η2π2d
1log(α−1)
,
where L(x) is the leaf containing x .
Theorem 2. ConsistencyAssume that the conditions of Theorem 1 hold, that µ(x) is Lipschitzcontinuous, and that the random forests are made up of honest treeswith subsample size s. Then for α ≤ 0.2,
|E[µ̂(x)]− µ(x)| = O(s−π log((1−α)−1)
2d log(α−1)
).
Making inferences using Random Forests 11/21 Hunyong Cho
Asymptotic normality
Classical CLT?
I Can we use classical CLT?
I NoteR̂F n(x) = 1
B
∑Bb=1 Tθ̂b(x), x ∈ X , where Tθ̂b(x) =
∑si SiYi
=∑n
i wi (x)Yi
I IF Yi ’s are iid, and Yi |= Si ’s then we may use weighted CLT. Butthey are neither identical nor independent.
I Honesty makes satisfy the independence condition.⇒ Still not enough.
Making inferences using Random Forests 12/21 Hunyong Cho
Asymptotic normality - Hoeffding decomposition
Hajek projection
I Decompose a tree:
T = E[T ] +n∑i
(E[T |Zi ]− E[T ])︸ ︷︷ ︸=:T̊ ,“Hajek projection”
+∑
i<j(E[T |Zi ,Zj ]− T̊ ) + ...
Note T̊ is sum of iid rv with finite variance, ⇒ CLT applies.Thus, if we show T̊ ≈ T are very similar, ⇒ T (x)→d N .
I But how similar is T̊ to T?Hajek’s theorem: If limn→∞
Var [T̊ ]Var [T ] = 1, then limn
E[‖T̊−T‖22]
Var [T ] = 0
I It suffices to see if T̊ captures almost all variability of T .
Making inferences using Random Forests 13/21 Hunyong Cho
Asymptotic normality - Hoeffding decomposition
Hajek projection
I Decompose a tree:
T = E[T ] +n∑i
(E[T |Zi ]− E[T ])︸ ︷︷ ︸=:T̊ ,“Hajek projection”
+∑
i<j(E[T |Zi ,Zj ]− T̊ ) + ...
Note T̊ is sum of iid rv with finite variance, ⇒ CLT applies.Thus, if we show T̊ ≈ T are very similar, ⇒ T (x)→d N .
I But how similar is T̊ to T?Hajek’s theorem: If limn→∞
Var [T̊ ]Var [T ] = 1, then limn
E[‖T̊−T‖22]
Var [T ] = 0
I It suffices to see if T̊ captures almost all variability of T .
Making inferences using Random Forests 13/21 Hunyong Cho
Asymptotic normality - Hoeffding decomposition
Hajek projection
I Decompose a tree:
T = E[T ] +n∑i
(E[T |Zi ]− E[T ])︸ ︷︷ ︸=:T̊ ,“Hajek projection”
+∑
i<j(E[T |Zi ,Zj ]− T̊ ) + ...
Note T̊ is sum of iid rv with finite variance, ⇒ CLT applies.Thus, if we show T̊ ≈ T are very similar, ⇒ T (x)→d N .
I But how similar is T̊ to T?Hajek’s theorem: If limn→∞
Var [T̊ ]Var [T ] = 1, then limn
E[‖T̊−T‖22]
Var [T ] = 0
I It suffices to see if T̊ captures almost all variability of T .
Making inferences using Random Forests 13/21 Hunyong Cho
Asymptotic normality - incrementality
Projected Trees are not so similar
I Individual trees are not so similar to the truth.
I But we only know that Trees are ν(s)−incremental. i.e.
Var [T̊ (x ;Z1, ...,Zs)]
Var [T (x ;Z1, ...,Zs)]& ν(s),
where ν(s) = Cf ,d/ log(s)d (by Theorem) and & denotes theasymptotic inequality of lim inf.
But projected Random forests are similar
I Wager and Athey showed that projected RF is close to RF.s an ensemble learner 1− incremental.)
E[(µ̂(x)− ˚̂µ(x))2] ≤( sn
)2
Var [T (x ; ξ,Z1, ...,Zs)]
Making inferences using Random Forests 14/21 Hunyong Cho
Asymptotic normality - theoretic results
Theorem 3. T is ν−incrementalSuppose that X1,X2, ... ∼iid D([0, 1]d), with some density f , that µ(x)and µ2(x) ≡ E[Y 2|X = x ] are Lipschitz continuous at x , thatVar [Y |X = x ] > 0, and that T is an honest k-regular symmetric tree.Then
T is ν(s)−incremental at x ,
where ν(s) = Cf ,d/ log(s)d and Cf ,d = 2−2(d+1)(d − 1)!.
Theorem 4. µ̂(x) is 1−incrementalLet µ̂(x) be a random forest with base learner T that satisfies theconditions of Theorem 3. Then
E[(µ̂(x)− ˚̂µ(x))2] ≤( sn
)2
Var [T (x ; ξ,Z1, ...,Zs)]
whenever Var [T (x ; ξ,Z1, ...,Zs)] is finite.
Making inferences using Random Forests 15/21 Hunyong Cho
Asymptotic normality - theoretic results, continued
Theorem 5. µ̂(x) is asymptotically normalAssume the conditions of Theorem 4. Moreover, assume thatlimn→∞ sn =∞, limn→∞ sn log(n)d/n = 0, and thatE[|Y − E[Y |X = x ]|2+δ|X = x ] ≤ M for some constants δ,M > 0uniformly over all x ∈ [0, 1]d . Then there exists a sequence σn(x)→ 0such that
µ̂n(x)− E[µ̂n(x)]
σn(x)→d N (0, 1).
Making inferences using Random Forests 16/21 Hunyong Cho
Asymptotic normality - theoretic results, continued
Infinitesimal jackknife variance estimatorWager, Hastie, and Tibshirani (2014) proposed “infinitesimal jackknifevariance estimator” for random forests.
V̂IJ(x ;Z1, ...,Zn) =n∑i
Cov∗[N∗i ,T (x ;Z∗1 , ...,Z
∗n )] ≈
n∑i
C 2i −
s(n − s)
n
v̂
B,
where (Z∗1 , ...,Z∗n ) is a subsample of (Z1, ...,Zn), N∗i is the number of
times Zi appears in the subsample, Ci = 1B
∑Bb (N∗bi − s/n)(T ∗b − T̄ ∗)
and v̂ = 1B
∑Bb (T ∗b − T̄ ∗)2.
Theorem 6. Consistency of the IJ variance estimatorLet V̂IJ(x ;Z1, ...,Zn) be the infinitesimal jackknife for random forestsdefined in Theorem 5. Then under the conditions of Theorem 5,
V̂IJ(x ;Z1, ...,Zn)/σ2n(x)→p 1.
Making inferences using Random Forests 17/21 Hunyong Cho
Application to precision medicine
Estimating heterogeneous treatment effects
I The parameter of interest shifts to
τ(x) := E[Y A=1 − Y A=0|X = x ]
I Combine the RF theories with the causal inference theories.Key assumptions: Conditional exchangeability, positivityOther assumptions: Causal consistency, SUTVA
Making inferences using Random Forests 18/21 Hunyong Cho
Normality of Causal Random Forests
I The bth causal tree (Wager et al 2017)is,
τ̂b =
∑{i :Ai=1,Xi∈Lb(x)} Yi
|{i : Ai = 1,Xi ∈ Lb(x)}|−
∑{i :Ai=0,Xi∈Lb(x)} Yi
|{i : Ai = 0,Xi ∈ Lb(x)}|,
I The causal random forest (Wager et al 2017) is then,
τ̂(x) =1
B
B∑b
τ̂b(x),τ̂(x)− τ(x)√
Var τ̂(x)→d N (0, 1)
Making inferences using Random Forests 19/21 Hunyong Cho
References - 1/2
Athey, S., & Imbens, G. (2016). Recursive partitioning forheterogeneous causal effects. Proceedings of the National Academyof Sciences, 113(27).
Biau, G. (2012). Analysis of a random forests model. The Journal ofMachine Learning Research, 13. 1063–1095.
Biau, G., Devroye, L., & Lugosi, G. (2008). Consistency of randomforests and other averaging classifiers. The Journal of MachineLearning Research, 9. 2015–2033.
Biau, G., & Scornet, E. (2016). A random forest guided tour. Test,25(2), 197-227.
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.
Lin, Y., & Jeon, Y. (2006). Random forests and adaptive nearestneighbors. Journal of the American Statistical Association, 101(474),578-590.
Making inferences using Random Forests 20/21 Hunyong Cho
References - 2/2
Meinshausen, N. (2006). Quantile regression forests. The Journal ofMachine Learning Research, 7, 983–999.
Mentch, L. & Hooker, G. (2016). Quantifying uncertainty in randomforests via confidence intervals and hypothesis tests. Journal ofMachine Learning Research, 17(26), 1–41.
Sexton, J. & Laake, P. (2009) Standard errors for bagged andrandom forest estimators. Compu- tational Statistics & DataAnalysis, 53(3), 801–811.
Scornet, E., Biau, G. & Vert, J.-P. (2015) Consistency of randomforests. The Annals of Statistics, 43(4), 1716–1741.
Wager, S., & Athey, S. (2017). Estimation and inference ofheterogeneous treatment effects using random forests. Journal of theAmerican Statistical Association.
Making inferences using Random Forests 21/21 Hunyong Cho