bayesian nonparametrics, applications to biology, ecology, and marketing

42
Bayesian Nonparametrics Applications to biology, ecology, and marketing Antonio Canale Universit` a di Torino & Collegio Carlo Alberto StaTalk 19 February, 2016

Upload: julyan-arbel

Post on 09-Feb-2017

512 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Bayesian NonparametricsApplications to biology, ecology, and marketing

Antonio Canale

Universita di Torino &Collegio Carlo Alberto

StaTalk19 February, 2016

Outline

Applications to

1 Toxicology

2 Ecology

3 Marketing

4 Human fertility

5 More applications

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Developmental toxicity studies

• Developmental toxicity is any alteration which interferes withnormal growth caused by environmental factors

• environmental factors include drugs, lifestyle factors such asalcohol, smoke, and environmental toxic chemicals or physicalfactors

• typical settings involve animals experiments

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Ethylene glycol

• Ethylene glycol is used in many industrial processes as e.g. anantifreeze, an industrial humectant, a solvent in paint and plasticindustry.

• we consider data from a developmental toxicity study of ethyleneglycol in mice conducted by the National Toxicology Program(Price et al. 1985)

• Pregnant mice were assigned to dose groups of 0, 750, 1500, or3000 mg/kg/day, with the number of implants measured for eachmouse at the end of the experiment.

• The scientific interest lies in studying a dose-response trend in thedistribution of the number of implants

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Ethylene glycol data (control group, mean 13.32,variance 4.89)

5 10 15 20

01

23

45

6

freq

uenc

ies

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Let’s go nonparametric!

Clearly we cannot try to estimate the pmf of the number of implantswith

yi ∼ Pois(λ)

λ ∼ Ga(a, b)

since the sampling model is too restrictive.

Hence we have a good reason to be nonparametric

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Simple approach

• A draw form the DP process produce an almost sure discretedistribution.

• We may think to assumeyi ∼ P

P ∼ DP(α,P0)

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Simple approach

• The posterior is in closed form, i.e.

(P | yn) ∼ DP

((α + n)

{αP0 +

∑i

δyi

}),

• which is actually quite unappealing in not allowing borrowing ofinformation about local deviations from P0.

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Simple approach

5 10 15 20

0.00

0.05

0.10

0.15

0.20

pmf

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Mixture of Poisson

• An alternative is

Pr(Y = j) =

∫Poi(j ;λ)dP(λ), P ∼ DP(αP0),

• DPM of Poisson seems extremely flexible and to provide a naturalmodification of the DPM of Gaussians;

• the resulting prior on the count distribution is actually quiteinflexible;

• distributions that are under-dispersed cannot be approximated;

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Round a continous distribution

• Take a continuous density

• Define a0 = 0, a1 = 1, . . .

• Calculate p(j) =∫ aj+1

ajf (x)dx

• Obtain the discrete countdistribution

0 1 2 3 4 5

0.0

0.2

0.4

0.6

x

f(x)

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Round a continous distribution

• Take a continuous density

• Define a0 = 0, a1 = 1, . . .

• Calculate p(j) =∫ aj+1

ajf (x)dx

• Obtain the discrete countdistribution

0 1 2 3 4 5

0.0

0.2

0.4

0.6

x

f(x)

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Round a continous distribution

• Take a continuous density

• Define a0 = 0, a1 = 1, . . .

• Calculate p(j) =∫ aj+1

ajf (x)dx

• Obtain the discrete countdistribution

0 1 2 3 4 5

0.0

0.2

0.4

0.6

x

f(x)

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Round a continous distribution

• Take a continuous density

• Define a0 = 0, a1 = 1, . . .

• Calculate p(j) =∫ aj+1

ajf (x)dx

• Obtain the discrete countdistribution

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

y

p(y)

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Rounded Gaussian Mixture (Canale and Dunson,2011)

p(·;P) =

∫RG (·;µ, τ−1)dP(µ, τ−1),

P ∼ DP(αP0),

Toxicology Ecology Marketing Human fertility More applications

Developmental toxicity studies

Rounded Gaussian Mixture

5 10 15 20

0.00

0.05

0.10

0.15

0.20

estimated pmf (blue) and empirical pmf (black)

pmf

0.0 0.2 0.4 0.6 0.8 1.0

−6

−4

−2

02

quantile

chan

ge in

# im

plan

ts

Toxicology Ecology Marketing Human fertility More applications

Animal abundance

Toxicology Ecology Marketing Human fertility More applications

Animal abundance

0 50 100 150 200 250

01

23

45

animal abundance

Toxicology Ecology Marketing Human fertility More applications

Another reason to avoid Poisson mixtures

• We compare

p(·;P) =

∫RG (·;µ, τ−1)dP(µ, τ−1),

p(·;P) =

∫Poi(·;λ)dP(λ)

• the DP is highly sensitive to the prior specifications of α which hasa major impact in the total number of clusters

• a more general NP prior can lead to more accurate estimates,especially for the number of mixture components. (Ishwaran andJames, 2001 and Lijoi et al. 2005, 2007)

P ∼ PY (θ, σ,P0)

Toxicology Ecology Marketing Human fertility More applications

Improving rounded mixtures (Canale and Prunster,2016)

1020

3040

50

σ

E(K

n | −

)

0.00 0.25 0.50 0.75

●●

● ● ●

● ● ● ●

1020

3040

50

σ

E(K

n | −

)

0.00 0.25 0.50 0.75

●●

●● ● ●●

●●

Figure: Posterior mean number of distinct clusters E [Kn|−] for the Okaloosadarters dataset: Poisson mixture and RG mixture for differentσ = 0, 0.25, 0.5, 0.75 and prior expected number of components E (Kn).

Toxicology Ecology Marketing Human fertility More applications

Marketing application

• we focus on data from 2, 050 SIM cards from customers having aprepayed contract in a single period;

• yi = (yi1, . . . , yi5) with the number of outgoing calls to fixednumbers (yi1), to mobile numbers of competing operators (yi2)and to mobile numbers of the same operator (yi3), the totalnumber of MMS (yi4) and SMS (yi5) sent;

Toxicology Ecology Marketing Human fertility More applications

• the RK method can be adapted in the multivariate context

• it is able to characterize the entire joint distribution;

• the use of underlying Gaussian mixtures allows the joint modelingof variables on different measurement scales (continuous,categorical, binary and counts). See also Canale and Dunson(2015)

• we can do inference on different objects: the whole multivariatedensity, the marginals, the conditionals.

• there are not so many alternatives to model a multivariate countdistribution!

Toxicology Ecology Marketing Human fertility More applications

Each concepts of before can be generalized into its multivariatecounterpart.

Pr(y = J) =

∫RKp(J; Θ)dP(Θ),

P ∼ DP(αP0)

with J ∈ N p and

RK (J; Θ) =

∫AJ

K (y∗; Θ)dy∗

where AJ = {y∗ : a1,J1 ≤ y∗1 < a1,J1+1, . . . , ap,Jp ≤ y∗p < ap,Jp+1}defines a disjoint partition of the sample space.

Toxicology Ecology Marketing Human fertility More applications

Marketing application

• we focused on the forecast of yi1, using data on yi2, . . . , yi5

• we split the dataset in a training and test subset;

• the approach is compared with prediction under a generalizedadditive model (GAM) with spline smoothing function;

• Smaller out-of-sample MAD (8.08 vs 8.76)

• side prediction automatically accomodate - e.g. pr(y1 = 0) orpr(y1 > T )

Toxicology Ecology Marketing Human fertility More applications

Human reproductive functioning

• we focus now on female reproductive functioning

• data refer to the basal body temperature (bbt), across themenstrual cycle.

• bbt curves follow a characteristic trajectory: during the follicularphase of the cycle leading up to ovulation, the bbt values tend tobe low, while after ovulation bbt rises progressively before droppingprior to the next cycle.

Toxicology Ecology Marketing Human fertility More applications

bbt curves

Toxicology Ecology Marketing Human fertility More applications

bbt curves model

• we model the data assuming

fij(t) = ηij(t) + εijt ,

where fij(t) is the cycle j of woman i at day t and η is theunderling bbt curve. The curve is observed with random noise εijt .

Toxicology Ecology Marketing Human fertility More applications

bbt curves: abnormal cycles

Toxicology Ecology Marketing Human fertility More applications

bbt curves mixture model

• we use the mixture model

p(ηij) = P, P =∞∑h=1

πhηh,

with a stick-breaking prior on the weights π and a suitable basemeasures ηh ∼ P0 (note that the atoms, here are curves)

• but the regular shape of a healthy woman is a well known fact apriori

• we are Bayesians, we can include this prior information!

Toxicology Ecology Marketing Human fertility More applications

bbt curves mixture model

• we use the mixture model

p(ηij) = P, P =∞∑h=1

πhηh,

with a stick-breaking prior on the weights π and a suitable basemeasures ηh ∼ P0 (note that the atoms, here are curves)

• but the regular shape of a healthy woman is a well known fact apriori

• we are Bayesians, we can include this prior information!

Toxicology Ecology Marketing Human fertility More applications

bbt curves mixture model

• we use the mixture model

p(ηij) = P, P =∞∑h=1

πhηh,

with a stick-breaking prior on the weights π and a suitable basemeasures ηh ∼ P0 (note that the atoms, here are curves)

• but the regular shape of a healthy woman is a well known fact apriori

• we are Bayesians, we can include this prior information!

Toxicology Ecology Marketing Human fertility More applications

Atomic base measure

• it is sufficient to assume that

P = wδη0 + (1− w)∞∑h=1

πhηh,

with η0 representing the S-shape trajectory known a priori.

• there are technical challenges in assuming an atomic base measurethat we are trying to solve (Canale, Nipoti, Lijoi and Pruenster,20??)

Toxicology Ecology Marketing Human fertility More applications

Atomic base measure

• it is sufficient to assume that

P = wδη0 + (1− w)∞∑h=1

πhηh,

with η0 representing the S-shape trajectory known a priori.

• there are technical challenges in assuming an atomic base measurethat we are trying to solve (Canale, Nipoti, Lijoi and Pruenster,20??)

Toxicology Ecology Marketing Human fertility More applications

Image reconstruction

(Wang, Canale, and Dunson 2016)

Toxicology Ecology Marketing Human fertility More applications

Brain-network data analysis

(Durante, Canale, and Dunson 201?)

Toxicology Ecology Marketing Human fertility More applications

Demand-supply model

(Canale and Ruggiero 2016)

Toxicology Ecology Marketing Human fertility More applications

To conclude

• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak

• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in

everyone else’s backyard.” (John Tukey)

Toxicology Ecology Marketing Human fertility More applications

To conclude

• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak

• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in

everyone else’s backyard.” (John Tukey)

Toxicology Ecology Marketing Human fertility More applications

To conclude

• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak

• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in

everyone else’s backyard.” (John Tukey)

Toxicology Ecology Marketing Human fertility More applications

To conclude

• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak

• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in

everyone else’s backyard.” (John Tukey)

Toxicology Ecology Marketing Human fertility More applications

To conclude

• BNP provides a set of useful tools for challenging applications• the Bayesian approach allow us to include prior informations• the nonparametric approach let the data speak

• today there are thousands of situations generating interesting datawith complex structures• “The best thing about being a statistician is that you get to play in

everyone else’s backyard.” (John Tukey)

Toxicology Ecology Marketing Human fertility More applications

That’s all folks!

Thanks for your attention!