discovering regression structure with a bayesian ensemble · 2014-01-30 · example: the boston...

59
Discovering Regression Structure with a Bayesian Ensemble Ed George, University of Pennsylvania (joint work with H. Chipman and R. McCulloch) The Loeb Research Lecture in Mathematics Washington University in St Louis November 3, 2011 1

Upload: others

Post on 28-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Discovering Regression Structure with a Bayesian Ensemble

Ed George, University of Pennsylvania (joint work with H. Chipman and R. McCulloch)

The Loeb Research Lecture in Mathematics Washington University in St Louis

November 3, 2011

1

Page 2: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Hugh Chipman and Rob McCulloch

2

Page 3: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

We’ve been working on this for a while! CHIPMAN, GEORGE, & MCCULLOCH (1998). Bayesian CART model search. JASA. DENISON, MALLICK & SMITH (1998). A Bayesian CART algorithm. Biometrika. CHIPMAN, GEORGE &MCCULLOCH (2002). Bayesian treed models. Machine Learning. CHIPMAN, GEORGE & MCCULLOCH (2007). Bayesian ensemble learning. In Neural Information Processing. CHIPMAN, GEORGE & MCCULLOCH (2010). BART: Bayesian Additive Regression Trees. Annals of Applied Statistics.

3

Page 4: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Example: The Boston Housing Data Regression Problem

Each observation corresponds to a geographic district y = log median house price in the region 13 x variables, stuff about the district eg. crime rate, % poor, riverfront, size, air quality, etc. n = 507 observations

4

Page 5: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Each boxplot depicts 20 rmse's out-of-sample for a version of a method. eg. the method neural nets with a given number of nodes and decay value.

Smaller is better. BART wins!

5

Page 6: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Performance Comparisons across 42 Datasets

Neural networks (single layer) Random Forests Boosting (Friedman's gradient boosting machine) Linear regression with lasso BART-cv BART-default - *NO* tuning of parameters

Compared out-sample-performance of 6 methods

Data from Kim, Loh, Shih and Chaudhuri (2006) (thanks, Wei-Yin!) Up to 65 predictors and 2953 observations

For each dataset: Created 20 random splits into 5/6 train and 1/6 test Use 5-fold cross validation on train to pick tuning constants Gives 20*42 = 842 out-of-sample predictions for each method For comparisons, we divided each RMSE by the best RMSE for each split

6

Page 7: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Relative RMSE Performance Ratios to Best

1.2 means 20% worse than the best performer BART-cv best (but not by too much) BART-default does amazingly well 7

Page 8: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

•  Data: n observations of y and x = (x1,...,xp) •  Suppose: Y = f(x) + ε, ε symmetric with mean 0 •  Unknowns: f and the distribution of ε Goal to make inference about f(x) = E(y |x) and/or a future y.

A General Model Free Regression Setup

8

(Trying to assume as little as possible)

Page 9: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

f(x) = g(x; θ1) + g(x; θ2) + ... + g(x; θm)

A General Bayesian Ensemble Idea

θ1, θ2,…, θm iid ~ π

For Y = f(x) + ε, approximate f(x) by something of the form

This can work well when the prior π is calibrated so that roughly, every one of the g(x; θj) components is small, but not too small!

Then use the posterior distribution of f given Y= y to draw inferences about f.

9

Page 10: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART estimates f(x) = E(Y |x) with a sum-of-trees (very flexible) BART provides

•  posterior intervals for f(x) •  estimates of marginal effects of each xj •  model-free importance rankings of x1,...,xp •  model-free interaction detection

BART is a Bayesian Ensemble Approach

10

Note that it is wiser to build a model from selected variables, than to select variables based on an assumed model! BART can be regarded as a pre-model building tool.

Page 11: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

A Single Regression Tree Model

x2 < d x2 ≥ d

x5 < c x5 ≥ c

µ3 = 7

µ1 = -2 µ2 = 5

Let g(x;θ), θ = (T, M) be a regression tree function that assigns a µ value to x

Let T denote the tree structure including the decision rules Let M = {µ1, µ2, … µb} denote the set of bottom node µ's.

A Single Tree Model: Y = g(x;θ) + ε 11

Page 12: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

The Coordinate View of g(x;θ)

x2 < d x2 ≥ d

x5 < c x5 ≥ c

µ3 = 7

µ1 = -2 µ2 = 5

Easy to see that g(x;θ) is just a step function

µ1 = -2 µ2 = 5

⇔ µ3 = 7

c

d x2

x5

12

Page 13: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Let θ = ((T1,M1), (T2,M2), …, (Tm,Mm)) identify a set of m trees and their µ’s.

Y = g(x;T1,M1) + g(x;T2,M2) + ... + g(x;Tm,Mm) + σ z, z ~ N(0,1)

The BART Ensemble Model

E(Y | x, θ) is the sum of all the corresponding µ’s at each tree bottom node. Such a model combines additive and interaction effects.

µ1

µ2 µ3

µ4

13 Remark: We here assume z ~ N(0, 1) for simplicity, but will later see a successful extension to a general DP process error model.

Page 14: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Completing the Model with a Regularization Prior

g(x;T1,M1), g(x;T2,M2), ... , g(x;Tm,Mm) is a highly redundant “over-complete basis” with many many parameters

Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z, z ~ N(0,1)

π((T1,M1),....(Tm,Mm),σ)

We refer to π as a regularization prior because it is deliberately chosen to keep each g(x;Ti,Mi) effect small.

14

Page 15: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

π( θ | y) ∝ p(y | θ) π( θ)

BART Implementation BART uses a fully Bayesian specification. Information about all the parameters, namely θ = ((T1,M1),....(Tm,Mm),σ), is captured by the posterior

The two main steps to implement BART: 1.  Construct the prior π(θ)

Independent tree generating process on T1,..,Tm Use observed y to properly scale π(M | T) Extremely effective default choice available

2. Calculate the posterior π(θ | y) Bayesian backfitting MCMC Interweaving marginalization and regeneration of θ

R package BayesTree available on CRAN

15

Page 16: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Bayesian Nonparametrics: Lots of parameters (to make model flexible) A strong prior to shrink towards simple structure (regularization) BART shrinks towards additive models with some interaction Dynamic Random Basis: g(x;T1,M1), ..., g(x;Tm,Mm) are dimensionally adaptive Gradient Boosting: Overall fit becomes the cumulative effort of many “weak learners”

Connections to Other Modeling Ideas

Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z plus

π((T1,M1),....(Tm,Mm),σ)

16

Page 17: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART is NOT Bayesian model averaging of a single tree model. Unlike boosting and random forests, the BART algorithm updates a fixed set of m trees, over and over. Choose m large for best estimation of E[Y|x] and prediction More trees yields more approximation flexibility Choose m small for variable selection Fewer trees forces the x’s to compete for entry

Some Distinguishing Features of BART

Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z plus

π((T1,M1),....(Tm,Mm),σ)

17

Page 18: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Friedman (1991) used n = 100 observations from this model with σ = 1 to illustrate the potential of MARS

Y = f(x) + σ z, z ~ N(0,1) where f(x) = 10 sin (πx1x2 ) + 20(x3 - .5)2 + 10x4 + 5x5 + 0x6 + … + 0x10

10 x's, but only the first 5 matter!

Example: Friedman’s Simulated Data

18

Page 19: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

10002

i ii 1

1 ˆRMSE (f(x ) f(x ))1000 =

= −∑

Performance measured on 1000 out-of-sample x’s by

Comparison of BART with Other Methods

19

Page 20: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Cross Validation Domain for Comparisons

20

Page 21: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Applying BART to the Friedman Example

Red m = 1 model

Blue m = 100 model

We applied BART with m = 100 trees to n = 100 observations of the Friedman example.

(x)f

95% posterior intervals vs true f(x) σ draws

in-sample f(x) out-of-sample f(x) MCMC iteration

21

Page 22: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

With only 100 observations on y and 1000 x's, BART yielded "reasonable" results !!!!

Added many useless x's to Friedman’s example

In-sample post int vs f(x)

f(x)20 x's

100 x's

1000 x's

Detecting Low Dimensional Structure in High Dimensional Data Out-of-sample post int vs f(x) σ draws

22

Page 23: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

High Dimensional Out-of-Sample RMSE Performance Comparisons

p = 10 p = 100 p = 1000

n = 100 throughout!

23

Page 24: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Increasing the number of Trees in BART

Rapid decrease in RMSE followed by slow increase (Friedman data again)

24

Page 25: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

The Effect of a Low Signal-to-Noise Ratio Look what happens when σ is increased with the Friedman model

25

Page 26: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART is Robust to Prior Settings BART’s robust RMSE performance is shown (again on the Friedman data) below where the prior hyperparameter (ν,q,k,m) choices are varied

26

Page 27: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Measuring Variable Importance in BART

x2 < d x2 ≥ d

x5 < c x5 ≥ c

When Y = g(x;T1,M1) + ... + g(x;Tm,Mm) + σ z is fit to data, we can count how many times a predictor is used in the trees.

For example, in the tree here, x2 and x5 are each used once. The importance of each xk can thus be measured by its overall usage frequency. This approach is most effective when the number of trees m is small.

27

Page 28: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Creating a Competitive Bottleneck

28

BART Y

x1 : xi : xp

BART is an automatic attractor of x’s to explain Y Those x’s which most explain Y are attracted most Small number of trees m creates a bottleneck which excludes useless x’s

Page 29: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Variable Selection via BART

Variable usage frequencies as the number of trees m is reduced

29

Page 30: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Variable Selection via BART

Using m = 10 trees, it remains effective with n = 100 and p = 90

30

Page 31: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART Offers Estimates of Individual Predictor Effects

Partial Dependence Plot of Crime Effect in Boston Housing Data

These are estimates of f3(x3) = (1/n) ∑i f(x3,xic) where xc = x \ x3 31

Page 32: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Partial Dependence Plots for the Friedman Example The Marginal Effects of x1 – x10

32

Page 33: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

For this data Least Squares yields R2 = 26% BART yields R2 = 42%

Y = LDHL (log of hdl level) x’s = CD4, Age, Sex, Race, Study, PI1,PI2,NNRTI2, NRTI1, NRTI2, ABI_349, CRC_71, CRC_72, CRC_55, CRC_73, CRC_10, ABI_383, ABI_387, ABI_391, ABI_395, ABI_400, ABI_401, CRC_66, CRC_67, CRC_68, CRC_69 n = 458 patients

Example: HIV Data Analysis

33

Page 34: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART suggests there is not a strong signal in x for this y.

The BART Fit for the HIV Data

34

Page 35: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Variable Selection via BART with m = 5, 10, 20, 50, 200 trees

35

Page 36: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

For example, the average predictive effect of ABI_383

Partial Dependence Plots Suggest Genotype Effects

36

Page 37: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

There appears to be no interaction effect

Predictive Inference about Interaction of NNRTI2 Treatment and ABI_383 Genotype

37

Page 38: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Each observation (n=245) corresponds to a football game.

y = Team A points - Team B points 29 x’s. Each is the difference between the two teams on some measure. eg x10 is average points against defense per game for Team A for team B.

The Football Data

38

Page 39: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

For each draw, for each variable calculate the percentage of time that variable is used in a tree. Then average over trees.

Variable Selection for the Football Data

Subtle point: Can’t have too many trees. Variables come in without really doing anything. 39

Page 40: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Marginal Effects of the Variables

Just used variables 2,7,10, and 14.

Here are the four univariate partial- dependence plots.

40

Page 41: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Goal: To predict the “activity” of a compound against a biological target. Data: n = 2604 compounds of which 604 are active Y = 0 or 1 according to whether compound is active x’s - 266 variables that characterize the compound’s molecular structure

Example: Drug Discovery

BART is easily extended to estimate P[Y = 1|x] by Albert-Chib augmentation with a probit link Applied BART to rank compounds by their P[Y = 1|x] estimates

41

Page 42: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART Posterior Intervals for 20 Compounds with Highest Predicted Activity

In-sample Out-of-Sample

42

Page 43: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Variable Importance in Drug Discovery with m = 5, 10, 20 trees

All 266 x’s Top 25 x’s

43

Page 44: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

A Sketch of the BART MCMC algorithm

The “parameter“ is:

“Simple" Gibbs sampler:

Y = g(x;T1,M1) + g(x;T2,M2) + ... + g(x;Tm,Mm) + σ z

(1)  Iteratively sample each (Tj,Mj) given Y, σ and other (Tj,Mj)s (Bayesian backfitting)

(2) Sample σ given Y and all (Tj,Mj)

Subtract all but g(x;Ti,Mi) from Y to update (Tj,Mj) Subtract all the g(x;Tj,Mj)s from Y to update σ

θ = ((T1,M1),....(Tm,Mm),σ)

44

Page 45: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

45

The simulation of each pair (T, M) in (1), is done via

p(T, M | data) = p(T | data) p(M | T, data) That is, we first margin out M and draw T, then draw M given T.

Simulating each pair from p(T, M | data)

Sampling M from p(M | T, data) is routine Just simulate µ’s from the posterior under our conjugate prior Sampling from p(T | data) is much more involved We essentially apply our earlier Bayesian CART algorithm to the residuals obtained as Y minus the sum of the other g(x;Ti,Mi)s

Page 46: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Because p(T | data) is available in closed form (up to a norming constant), we use a Metropolis-Hastings algorithm. Our proposal moves around tree space by proposing local modifications such as

=> ?

=> ?

propose a more complex tree

propose a simpler tree

Such modifications are accepted according to their compatibility with p(T | data). 46

Simulating p(T | data) with the Bayesian CART Algorithm

Page 47: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Each iteration of the BART algorithm provides a complete update of the BART parameters (T1,M1),....(Tm,Mm) and σ.

This is a Markov chain converging to the posterior.

Surprisingly quick burn-in and convergence.

The Redundant, Dynamic Random Basis in Action

47

The Overall BART MCMC algorithm

As the chain runs:

Each tree contributes a small part to the fit, and the fit is swapped around from tree to tree.

We often observe that an individual tree grows quite large and then collapses back to a single node.

This illustrates how each tree is dimensionally adaptive.

Page 48: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

At iteration i we draw from the posterior of f

To estimate f(x) we simply average the draws at x

Posterior uncertainty is captured by variation of the eg, 95% HPD region estimated by middle 95% of values

Using the MCMC Output to Draw Inference

)M,T,g( )M,T,g( )M,T,g( )(f mimi2i2i1i1ii ⋅++⋅+⋅=⋅ ˆ

(x)fi

Similarly, it is straightforward to estimate functionals of f such as partial dependence functions

fk(xk) = (1/n) ∑i f(xk,xic) where xc = x \ xk

)(ˆ ⋅if

48

Page 49: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

First, introduce prior independence as follows

Thus we only need to choose π(T), π(µ | T) = π(µ) and π(σ).

A Sketch of the Prior

π((T1,M1),....,(Tm,Mm), σ) = [ Π π(Tj,Mj) ] π(σ) = [ Π π(µij | Tj) π(Tj) ] π(σ)

49

Page 50: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

1 2 3 4 5 6 7

0500

1000

1500

2000

2500Marginal prior on

number of bottom nodes. Hyperparameters chosen to put prior weight on small trees!!

We specify a process that grows trees:

Step 1) Grow a tree structure with succesive biased coin flips Step 2) Randomly assign variables to decision nodes Step 3) Randomly assign splitting rules to decision nodes

π(T)

50

Page 51: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

π(µ | T)

For each bottom node µ, let

.5k m .5k mµ µσ = ⇒σ =

This conveniently done by shifting and rescaling y so that (ymin, ymax) = (-0.5, 0.5) and then setting µ µ = 0 and

)σ,N(µ ~µ 2µµ

)σ,µN( ~ 2µµ mm x]|E[Ywhich induces

Since we are pretty sure that we chose µµ and σµ such that for a prechosen k (such as 2 or 3)

),()σµ,σ-µ( maxminµµµµ yy=+ mkmmkm

Note how the prior adapts to m: σµ gets smaller as m gets larger. Default choice is k = 2.

),,( maxmin yy∈x]|E[Y

51

Page 52: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

π(σ)

22~ν

νλσ

χLet

Determine λ by setting a quantile such as .75, .95 or .99 at this rough estimate.

The three priors we would consider when

ˆ 2σ =

To set λ, we use a rough overestimate of σ based on the data (such as sd(y) or the LS estimate for the saturated linear regression).

and consider ν = 3, 5 or 10.

52

Page 53: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

53

Weakening the Error Distribution Assumption

Simply replace by where DP is the Dirichlet Process.

ε i ~ N(0, σ i2)

σ i ~ GG ~ DP(α,G0)

ε i ~ N(0, σ2) iid

G0 is the same prior we used for the default σ prior. BART MCMC still well behaved. Not a model for heteroscedasticity (future work !!), just a more general error distribution.

Page 54: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

54

So, we now have our fully flexible model specification:

Y = f(x) + ε, ε symmetric with mean 0 Issue: Is the richer version of BART too flexible to work? Back to the Friedman example

Original case: σi ≡ 1

Mixture case: σi = 1 with probability .75 σi = 10 with probability .25.

Page 55: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

55

RMSE Comparison of Normal vs DP Models

Top pair for normal errors Bottom pair for mixture errors Left boxplot for normal model Right boxplot for DP model Similar performance for normal errors DP is much better for mixture errors

Page 56: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

BART Extensions and Applications

ABREVEYA & MCCULLOCH (2006). Reversal of fortune: A statistical analysis of penalty calls in the national hockey league. ABU-NIMEH, NAPPA, WANG & NAIR (2008). Bayesian Additive Regression Trees-Based Spam Detection for Enhanced Email Privacy. 2008 Third International Conference on Availability, Reliability and Security. ABU-NIMEH, NAPPA, WANG & NAIR (2009). Hardening Email Security using Bayesian Additive Regression Trees, Machine Learning. ABU-NIMEH, NAPPA, WANG & NAIR (2009). Distributed Phishing Detection by Applying Variable Selection using Bayesian Additive Regression Trees, IEEE International Conference on Communications. CHIPMAN, GEORGE, LEMP, MCCULLOCH (2010). Bayesian Flexible Modeling of Trip Durations. Transportation Research Part B: Methodological. ELIASBERG, HUI & ZHANG (2010). Green-lighting Movie Scripts: Revenue Forecasting and Risk Management. GREEN & KERN (2010). Modeling heterogeneous treatment effects in large-scale experiments using Bayesian Additive Regression Trees 56

Page 57: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

57

HILL (2010). Bayesian Nonparametric Modeling for Causal Inference, JCGS HILL & ZUBIZARRETA (2010). Causal Inference in Observational Studies with Missing Data: A Bayesian Nonparametric Approach KOURTELLOS, TAN, & ZHANG (2007). Is the Relationship between Aid and Economic Growth Nonlinear? Journal of Macroeconomics WANG & KARR (2010). Preserving data utility via BART, JSPI YU, LI, FANG & PENG (2010). An adaptive sampling scheme guided by BART - with an application to predict processor performance, Can J of Stat YU, MACEACHERN & PERUGGIA (2009). Bayesian Synthesis ZHANG & HAERDLE (2010). The Bayesian additive classification tree applied to credit risk modelling. Comput Statist Data Anal ZHANG, SHIH & MULLER (2007). A spatially-adjusted Bayesian additive regression tree model to merge two datasets. Bayesian Analysis ZHOU & LIU (2008). Extracting sequence features to predict protein-DNA binding: A comparative study. Nucleic Acids Research

BART Extensions and Applications (cont.)

Page 58: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

•  Further Error Structure Magic •  Monotone BART •  Model Building via BART

Plenty More to Do

58

Page 59: Discovering Regression Structure with a Bayesian Ensemble · 2014-01-30 · Example: The Boston Housing Data Regression Problem Each observation corresponds to a geographic district

Thank You!

59