making inferences using random forests and its application ...dzeng/bios740/cho.pdfi consistent...

Making inferences using random forestsand its application to precision medicine

Hunyong Cho

April 2018

Background

What Machine Learning techniques cannot do

I High performance in prediction, but

I A valid statistical inference lacks.“Can we use Neural Networks to support the effectiveness of a newdrug?”

I Statisticians are making efforts on this.

Making inferences using Random Forests 2/21 Hunyong Cho

Literature

Some efforts on making inference with random forest models

I Consistent random forestsLin and Jeon (2006), Meinshausen (2006), Scornet et al. (2015)

I Simplified forestsBiau (2012), Biau et al. (2008)

I Variance estimatorsMentch and Hooker (2016), Sexton and Laake (2009)

I Stefan Wager & Susan Athey (2017) provedconsistency and asymptotic normality of random forestsof certain kinds.


How can random forests be used for inference?

Consistency and Asymptotic normality

ConsistencyUnder certain conditions,

|E[µ̂(x)]− µ(x)| = O(s−π log((1−α)−1)

2d log(α−1)

),

where s is the subsample size of member trees, α ≤ 0.2.

Asymptotic NormalityUnder certain conditions,

µ̂n(x)− µ(x)

σn(x)→d N (0, 1) for a sequence σn(x)→ 0,

V̂IJ(x)/σ2n(x)→p 1,

where sn � nβ for some βmin < β < 1,V̂IJ(x) is the infinitesimal jackknife variance estimator.


Preliminaries - TreesTree

I Partition the feature space into two at each node.

I Exhaustively search best feature and best cut point:“best” in terms of minimizing MSE (regression) or node impurity(classification)

I Recursion and stopping rule (eg. node size, deviance reduction)T (x ; ξ,Z1, ...,Zn)

Random forest (Breiman, 2001)

1. Bootstrap or randomly subsample the data.

2. Build a tree by finding best splits among random set of features ateach node.

3. Repeat Steps 1 and 2 to have B trees

4. Aggregate the B trees.RF (x ;Z1, ...,Zn) = 1

B

∑Bb=1 T (x ; ξ∗b ,Z

∗b1, ...,Z

∗bs),

Biau and Scornet (2016) gave a good review on theoretical andmethodological development of random forests.Making inferences using Random Forests 5/21 Hunyong Cho

Preliminaries - Potential Nearest Neighbors

I Conventional k-NN

Figure 1: Illustration of a Euclidean 6-NN

I k-PNN (POTENTIAL nearest neighbor. Lin & Jeon, 2006)

Figure 2: Illustration of a 1-PNN

I Trees are k-PNN predictors.Thus, they can be written as T (x) =

∑ni SiYi .


Preliminaries - Potential Nearest Neighbors

RF - Ensembel of Potential Nearest Neighbor Predictors

I Random forests consist of Trees trained on random subsamples.

...

I ⇒ R̂F n(x) =∑n

i wi (x)Yi


Preliminaries - HonestyConventional trees

I Construct partitions using whole data.

I Estimate the mean again using the same data.

I May tend to bias.

Honest trees (Athey and Imbens, 2016)

I Split data into half and half.

I Construct partitions (i.e. determine splits) using a half.

I Estimate the mean using another half.

Honest trees - Properties

I Guanrantees Si and Yi are independent.⇒ Enables quantifying var(T ) (for Consistency and Normality).

I May lose some efficiency(A single tree does not utilize the whole information in estimation.)

I But better in terms of MSE (because of reduced bias.)


Preliminaries - HonestyConventional trees

I Construct partitions using whole data.

I Estimate the mean again using the same data.

I May tend to bias.

Honest trees (Athey and Imbens, 2016)

I Split data into half and half.

I Construct partitions (i.e. determine splits) using a half.

I Estimate the mean using another half.

Honest trees - Properties

I Guanrantees Si and Yi are independent.⇒ Enables quantifying var(T ) (for Consistency and Normality).

I May lose some efficiency(A single tree does not utilize the whole information in estimation.)

I But better in terms of MSE (because of reduced bias.)Making inferences using Random Forests 8/21 Hunyong Cho

Preliminaries - Random split

Conventional trees

I Not all the features are guaranteed a positive chance of being split.

Random split trees

I Random-split trees give a minimum amount of positive chance ofsplit to every feature.“Every step of the tree-growing procedure, the probability that thenext split occurs along the j−th feature is bounded below by π/dfor some 0 < π ≤ 1, for all j = 1, ..., d .”

Random split tree - properties

I The size of partition is arbitrarily small for all aspect of features forn >> 1. (⇒ Key for Consistency)


Preliminaries - Other concepts

α−regular tree

I A tree is called α−regular if

I after each split, both of the daughter nodes have at least a fractionα of the training sample, and

I the tree is fully grown with depth k for some k ∈ N.(i.e. each terminal node size is in [k , 2k − 1])

symmetric tree

I A predictor is symmetric if

I the output of the predictor is independent of the order of the sampleindex.(i.e. µ̂(x ;X1,X2, ...,Xn) = µ̂(x ;Xh1 ,Xh2 , ...,Xhn) for any permutationh ≡ (h1, ..., ) of (1, ..., n))


Consistency

Theorem 1. Infinitesimal leavesSuppose that a tree, T , is regular and random-split, and thatX1, ...,Xs ∼ind U([0, 1]d). Then for any 0 < η < 1 and for large enough s,

P

diamj(L(x)) ≥(

s

2k − 1

)− 0.99π(1−η) log((1−α)−1)

d log(α−1)

≤ ( s

2k − 1

)− η2π2d

1log(α−1)

,

where L(x) is the leaf containing x .

Theorem 2. ConsistencyAssume that the conditions of Theorem 1 hold, that µ(x) is Lipschitzcontinuous, and that the random forests are made up of honest treeswith subsample size s. Then for α ≤ 0.2,

|E[µ̂(x)]− µ(x)| = O(s−π log((1−α)−1)

2d log(α−1)

).


Asymptotic normality

Classical CLT?

I Can we use classical CLT?

I NoteR̂F n(x) = 1

B

∑Bb=1 Tθ̂b(x), x ∈ X , where Tθ̂b(x) =

∑si SiYi

=∑n

i wi (x)Yi

I IF Yi ’s are iid, and Yi |= Si ’s then we may use weighted CLT. Butthey are neither identical nor independent.

I Honesty makes satisfy the independence condition.⇒ Still not enough.


Asymptotic normality - Hoeffding decomposition

Hajek projection

I Decompose a tree:

T = E[T ] +n∑i

(E[T |Zi ]− E[T ])︸︷︷︸=:T̊ ,“Hajek projection”

+∑

i<j(E[T |Zi ,Zj ]− T̊ ) + ...

Note T̊ is sum of iid rv with finite variance, ⇒ CLT applies.Thus, if we show T̊ ≈ T are very similar, ⇒ T (x)→d N .

I But how similar is T̊ to T?Hajek’s theorem: If limn→∞

Var [T̊ ]Var [T ] = 1, then limn

E[‖T̊−T‖22]

Var [T ] = 0

I It suffices to see if T̊ captures almost all variability of T .


Asymptotic normality - incrementality

Projected Trees are not so similar

I Individual trees are not so similar to the truth.

I But we only know that Trees are ν(s)−incremental. i.e.

Var [T̊ (x ;Z1, ...,Zs)]

Var [T (x ;Z1, ...,Zs)]& ν(s),

where ν(s) = Cf ,d/ log(s)d (by Theorem) and & denotes theasymptotic inequality of lim inf.

But projected Random forests are similar

I Wager and Athey showed that projected RF is close to RF.s an ensemble learner 1− incremental.)

E[(µ̂(x)− ˚̂µ(x))2] ≤( sn

)2

Var [T (x ; ξ,Z1, ...,Zs)]


Asymptotic normality - theoretic results

Theorem 3. T is ν−incrementalSuppose that X1,X2, ... ∼iid D([0, 1]d), with some density f , that µ(x)and µ2(x) ≡ E[Y 2|X = x ] are Lipschitz continuous at x , thatVar [Y |X = x ] > 0, and that T is an honest k-regular symmetric tree.Then

T is ν(s)−incremental at x ,

where ν(s) = Cf ,d/ log(s)d and Cf ,d = 2−2(d+1)(d − 1)!.

Theorem 4. µ̂(x) is 1−incrementalLet µ̂(x) be a random forest with base learner T that satisfies theconditions of Theorem 3. Then

E[(µ̂(x)− ˚̂µ(x))2] ≤( sn

)2

Var [T (x ; ξ,Z1, ...,Zs)]

whenever Var [T (x ; ξ,Z1, ...,Zs)] is finite.


Asymptotic normality - theoretic results, continued

Theorem 5. µ̂(x) is asymptotically normalAssume the conditions of Theorem 4. Moreover, assume thatlimn→∞ sn =∞, limn→∞ sn log(n)d/n = 0, and thatE[|Y − E[Y |X = x ]|2+δ|X = x ] ≤ M for some constants δ,M > 0uniformly over all x ∈ [0, 1]d . Then there exists a sequence σn(x)→ 0such that

µ̂n(x)− E[µ̂n(x)]

σn(x)→d N (0, 1).


Asymptotic normality - theoretic results, continued

Infinitesimal jackknife variance estimatorWager, Hastie, and Tibshirani (2014) proposed “infinitesimal jackknifevariance estimator” for random forests.

V̂IJ(x ;Z1, ...,Zn) =n∑i

Cov∗[N∗i ,T (x ;Z∗1 , ...,Z

∗n )] ≈

n∑i

C 2i −

s(n − s)

n

v̂

B,

where (Z∗1 , ...,Z∗n ) is a subsample of (Z1, ...,Zn), N∗i is the number of

times Zi appears in the subsample, Ci = 1B

∑Bb (N∗bi − s/n)(T ∗b − T̄ ∗)

and v̂ = 1B

∑Bb (T ∗b − T̄ ∗)2.

Theorem 6. Consistency of the IJ variance estimatorLet V̂IJ(x ;Z1, ...,Zn) be the infinitesimal jackknife for random forestsdefined in Theorem 5. Then under the conditions of Theorem 5,

V̂IJ(x ;Z1, ...,Zn)/σ2n(x)→p 1.


Application to precision medicine

Estimating heterogeneous treatment effects

I The parameter of interest shifts to

τ(x) := E[Y A=1 − Y A=0|X = x ]

I Combine the RF theories with the causal inference theories.Key assumptions: Conditional exchangeability, positivityOther assumptions: Causal consistency, SUTVA


Normality of Causal Random Forests

I The bth causal tree (Wager et al 2017)is,

τ̂b =

∑{i :Ai=1,Xi∈Lb(x)} Yi

|{i : Ai = 1,Xi ∈ Lb(x)}|−

∑{i :Ai=0,Xi∈Lb(x)} Yi

|{i : Ai = 0,Xi ∈ Lb(x)}|,

I The causal random forest (Wager et al 2017) is then,

τ̂(x) =1

B

B∑b

τ̂b(x),τ̂(x)− τ(x)√

Var τ̂(x)→d N (0, 1)


References - 1/2

Athey, S., & Imbens, G. (2016). Recursive partitioning forheterogeneous causal effects. Proceedings of the National Academyof Sciences, 113(27).

Biau, G. (2012). Analysis of a random forests model. The Journal ofMachine Learning Research, 13. 1063–1095.

Biau, G., Devroye, L., & Lugosi, G. (2008). Consistency of randomforests and other averaging classifiers. The Journal of MachineLearning Research, 9. 2015–2033.

Biau, G., & Scornet, E. (2016). A random forest guided tour. Test,25(2), 197-227.

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Lin, Y., & Jeon, Y. (2006). Random forests and adaptive nearestneighbors. Journal of the American Statistical Association, 101(474),578-590.


References - 2/2

Meinshausen, N. (2006). Quantile regression forests. The Journal ofMachine Learning Research, 7, 983–999.

Mentch, L. & Hooker, G. (2016). Quantifying uncertainty in randomforests via confidence intervals and hypothesis tests. Journal ofMachine Learning Research, 17(26), 1–41.

Sexton, J. & Laake, P. (2009) Standard errors for bagged andrandom forest estimators. Compu- tational Statistics & DataAnalysis, 53(3), 801–811.

Scornet, E., Biau, G. & Vert, J.-P. (2015) Consistency of randomforests. The Annals of Statistics, 43(4), 1716–1741.

Wager, S., & Athey, S. (2017). Estimation and inference ofheterogeneous treatment effects using random forests. Journal of theAmerican Statistical Association.


making inferences using random forests and its application ...dzeng/bios740/cho.pdfi consistent...

Documents