brian reich north carolina state university june, 2019bjreich/talks/deepdensereg.pdf · 2019. 6....

Density regression via deeplearning

Brian Reich

North Carolina State University

June, 2019

Brian Reich, NC State Density regression using deep learning 1 / 55

Collaborators

This is largely work of PhD students Neal Grantham and Rui Li

Others: Howard Bondell, Eric Laber, Krishna Pacifici, Rob Dunn


Density regression

I Most statistical analyses are mean regressions:

E(Y |X) =

p∑j=1

Xjβj

I This is a reasonable first-order approximation and leads toa simple and interpretable model

I Density regression allows the entire distribution of theresponse to depend on covariates

I For example, a covariate might affect the mean, variance,skewness, etc.

I This provides a more comprehensive study of covariateeffects and more realistic prediction distributions


Density regression

I Density regression is more challenging to fit that meanregression

I This is especially true when there are many covariates

I In this talk we evoke deep learning for density regression

I Deep learning is great for prediction but poor for inference

I We apply this method to two environmental applicationswhere the objective is prediction


Application 1: Geolocation using microbiome data

I The microbiome is the community of microbial organisms

I Sequencing technology makes it possible to affordablyidentify microbes

I Our collaborators collected dust samples from the outerdoor frames of 1,300 homes in the US

I DNA sequencing revealed 50,000 fungal taxa

I Can we use the microbiome of the sample (X) to predict itslocation of origin (Y )?


Application 2: Solar energy forecasting

I Solar and wind energy forecasting is big business

I We use recent meteorology and numerical forecasts forshort-term prediction

I Stochastic forecasts that assess uncertainty are crucial

I This allows for

I prediction of features like exceeding thresholds

I propagating uncertainty into economic models


Application 2: Solar energy forecasting


Density regression with deep learning

I In both problems there are many predictors that mightaffect the predictive distribution

I We use the same general strategy for both problems:

1. Randomly partition the prediction domain

2. Train a deep learning classifier on the partitions

3. Repeat many times and aggregate


Forensic geolocation model

I Let Y ∈ D ⊂ R2 be the spatial location of a sample

I The predictors X are the p binary indicators of thepresence of each taxa in the sample

I To approximate the density of Y |X, we randomly partitionthe spatial domain into K tiles

I Let v1, ..., vKiid∼ Uniform(D) be “seeds” that define tiles

Pk = {s; ||s− vk || < ||s− vl || for all l 6= k}

I Y is reduced to the label g where g = k means Y ∈ Pk


A random partition



I We then regress the labels g onto X using a multi-classclassification algorithm

I Deep learning turned out to be the best classifier

I Let π̂k (X) be the fitted probability of Y ∈ Pk given X

I Assuming a uniform density within each tile gives thepredictive density

p(y |X) =K∑

k=1

π̂k (X)1|Pk |

I(y ∈ Pk )

where |Pk | is the area of tile k



I Pro: Can be fit quickly with standard software

I Cons: Reliance on the number and configuration of thetiles and the predictive density is discontinuous

I Solution: Repeat many times with a different number ofrandom seeds and average the predictive densities

I We call this method “Deep space”

I Properties: As J,K →∞ it can approximate anycontinuous conditional density function


The deep space algorithm

For j = 1, ..., J

1. Draw the number of tiles Kj ∼ Uniform(a,b)

2. Draw the seeds vj1, ..., vjKj ∼ Uniform(D)

3. Train a classifier to obtain tile probabilities π̂j1(X), ..., π̂jKj (X)

The final predictive density is

p(y |X) =1J

J∑j=1

Kj∑k=1

π̂jk (X)1|Pjk |

I(y ∈ Pjk )


National analysis

We compared the following models using cross-validation:

1. NN: Nearest neighbors analysis

2. RF: Geolocation using random forests

3. Net: Geolocation using a shallow neural network

4. DeepSpace (DS): Geolocation using a deep neuralnetwork with three hidden layers

5. State DNN: A deep neural network with US states as tiles

6. BDA: Naive Bayes classifier based on kernel-smoothedoccurrence probability for each taxa

Methods 2-4 use K ∼ Uniform(0.05n,0.50n)


Bayesian discriminant analysis (BDA)


As seen on TV!


Cross-validation results

Median Area match (%)Model error (km) Coverage State County CityDeepSpace 97.8 96.3 60.2 23.6 19.4Net 113.3 94.3 58.2 23.9 19.7State DNN 211.0 - 57.1 - -RF 213.7 98.6 47.6 17.0 14.2NN 247.9 90.0 44.6 14.6 12.1BDA 263.7 91.0 31.9 1.6 0.8


Average errors for deep space


Regional analysis

I We also the n = 116 samples from central North Carolinacounties of Wake, Durham and Orange

I By focusing on a small geographic area we can isolate theability of the models to predict the origin of a sample whenbiogeographic differences are held relatively constant

I In this analysis we seek to determine if there is a limit tothe resolution that one may geolocate samples usingfungal occupancy data



Median Area match (%)Model error (km) Coverage County CityCounty DNN 18.0 - 53.4 -Net 19.2 90.5 49.1 25.9BDA 19.5 90.5 40.5 19.0DeepSpace 20.0 90.5 40.5 18.1RF 20.2 93.1 36.2 19.0NN 20.4 84.5 43.1 24.1


Global analysis

I With the generous funding of the US Department ofDefense we gathered n = 399 samples from 28 countries

I The data span Eastern Europe, Middle East, Africa, Asia,Oceania, and the Americas

I There were 10− 20 sampling locations in each country.

I Samples within each country often stem from a singlemajor city

I We therefore compared the models only on their ability todetect a sample’s country of origin



Model Classification accuracyDeepSpace 89.5%Country DNN 84.7%Net 84.2%RF 74.9%NN 62.7%


Deep space confusion matrix


Summary

I The method works well at continental and global scales,but not at regional scales

I We have worked with the Department of Defense toimplement this method

I Future work is to:I Incorporate covariates such as climate and land-coverI Analyze samples of mixed originI Generalize and study theory (next!)


Non-spatial applications

I Geolocation is an important but narrow problem

I After completing this we began generalizing the method tonon-spatial problems

I In general density regression we have univariate responseY , scaled so Y ∈ [0,1]

I We apply the deep space algorithm to regress the densityof Y onto covariates X

I We call this the Deep Density Regression (DDR) method


The DDR algorithm

For j = 1, ..., J

1. Draw the K − 1 cutpoints vjkiid∼ Uniform(0,1)

2. Sort the cutpoints so yjk = vj(k) (yj0 = 0 and yjK = 1)3. Assign observations to bins, g = k if Y ∈ (yk−1, yk ) = Pjk

4. Train a classifier to obtain tile probabilities π̂j1(X), ..., π̂jK (X)

The final predictive density is

p(y |X) =1J

J∑j=1

K∑k=1

π̂jk (X)1|Pjk |

I(y ∈ Pjk )


Loss function 1 - Multiclass regression

I Parameterize the bin probabilites as πk (X;θ)

I We model πk using a deep neural network with softmax(multinormal logistic) link

I In this model θ includes the biases and weights

I These parameters are estimated to minimize

−n∑

i=1

K∑k=1

I(yk−1 < Yi < yk ) log[πk (Xi ;θ)]

I Then π̂k (X) = πk (X; θ̂)


Loss function - Binary cross-entropy loss

I The previous loss ignores bin ordering and is sensitive to K

I Denote the CDF as

Prob(Y ≤ yk |X) = Fk (X;θ) =k∑

l=1

πl(X;θ)

and F̄k (X;θ) = 1− Fk (X;θ)

I The loss function is then

−K∑

k=1

n∑i=1

{I(Yi ≤ yk ) log[Fk (Xi ;θ)]+I(Yi > yk ) log[F̄k (Xi ;θ)]}

I As before, π̂k (X) = πk (X; θ̂)


Theory without covariates

I Theory has already been worked out for the histogramwithout covariates (e.g., Wasserman 2013)

I Assume the true PDF has bounded second derivative

I The fixed-bin histogram is consistent if n,K →∞ andK/n→ 0

I The optimal number of bins is K = O(n1/3)

I This holds for the random histogram without covariates


Theory with covariates

I Assume there are K fixed and equally-sized binsI The true probability in bin k is πk (X) =

∫Pk

f (y |X)dyI Let π̂k (X) be an estimator of πk (X)

I Assume1. f (y |X) has bounded second derivative2. n,K →∞ and K/n→ 03. Bias(π̂k (X)) = o(1/K ) for all k4. Var(π̂k (X)) = o(1/K 2) for all k

I Then the conditional density estimator

f (y |X) =K∑

k=1

1|Pk |

I(y ∈ Pk )π̂k (X)

is consistent


Theory with covariates

I This theorem is model agnostic

I If we assume that

πk (X) ≈ exp(XTβk )∑Kl=1 exp(XTβl)

the parametric multinomial logistic linear model with fixednumber of covariates satisfies the theorem’s conditions

I We are still exploring the theoretical properties of deeplearning


Simulation study

I We compare fixed and random histograms with the twoloss function for various number of breakpoints K

I We also compare with quantile random forests

I In all cases there are n = 6,000 training observations

I Models are compared using the continuous rank probabilityscore (CRPS) averaged over 1,000 test set observations

I Coverage is not included here, but is close to the nominallevel for all methods with sufficiently large K


Simulation study 1 - mixture of nonlinear regressions

Y = [sin(X1) + ε1]π1 + [2 sin(1.5X1 + 1) + ε2](1− π1)

I X1 ∼ Uniform(0,10)

I π1 ∼ Bernoulli(0.5)

I ε1 ∼ Normal(0,0.09)

I ε2 ∼ Normal(0,0.64)


Fitted PDF - Fixed bins and X = 2


Fitted PDF - Random bins and X = 2


Fitted CDF - Random bins and X = 2


Simulation study 1 - Mixture of nonlinear regressions


Simulation study 2 - Heteroskedastic linear model

Y |β1,β2 ∼ Normal(

XTβ1, exp(XTβ2))

I X1, ...,X5iid∼ Normal(0,1)

I β1 ∼ Normal(0, I5)

I β2 ∼ Normal(0,0.45I5)


Simulation study 2 - Heteroskedastic linear model



Y = [10 sin(2πX1X2) + 10X4 + ε1]π1

+[20(X3 − 0.5)2 + 5X5 + ε2

](1− π1)

I X1, ...,X10iid∼ Uniform(0,1)

I π1 ∼ Bernoulli(0.5)

I ε1 ∼ Normal(0,2.25)

I ε2 ∼ Normal(0,1)


Simulation study 4 - Nonlinear non-Gaussian model

Y = 10 sin(2πX1X2) + 20(X3 − 0.5)2 + 10X4 + 5X5 + ε

I X1, ...,X10iid∼ Uniform(0,1)

I ε ∼ SkewNormal(0,1,−5)


Simulation study 4 - Nonlinear non-Gaussian model


Summary of the simulation study

I The random histogram is slightly better than fixed bins

I The binary cross entropy loss function is far superior to themulti-class loss function

I In particular, the binary cross entropy loss is much lesssensitive to K

I Coverage of predictive intervals has the nomial coveragefor large K


Solar power forecasting

I Global Energy Forecasting Competition 2014 was an IEEEsponsored competition on probabilistic forecasting

I The n = 25K responses Y are the amount of solar powergeneration

I The competition organizers normalized Y ∈ [0,1]

I There are 45 covariates X includingI Solar irradiance, temperature, wind speed and direction,

relative humidity, air pressure, etc.I Dummy variables for hour of the day and seasonI Dummy variables for the solar farm ID


Solar power forecasting cross validation


Example forecasts

The model was trained at the beginning of each month


Summary

I Density regression is under-utilized

I DDR is simple to implement using existing software

I We are working on a python package

I DDR prediction intervals for deep learning

I This work was supported by the US DOD


brian reich north carolina state university june, 2019bjreich/talks/deepdensereg.pdf · 2019. 6....

Documents