a statistician's view of upcoming grand challenges

20
A Statistician's View A Statistician's View of Upcoming Grand of Upcoming Grand Challenges Challenges Alanna Connors Alanna Connors Imputed by Xiao-Li Meng Imputed by Xiao-Li Meng Joint work with Alex Blocker, Paul Baines Joint work with Alex Blocker, Paul Baines Vinay Kashyap, Pavlos Protopapas, and Vinay Kashyap, Pavlos Protopapas, and Andreas Zezas Andreas Zezas (all members of CBAS, a.k.a CHASC) (all members of CBAS, a.k.a CHASC)

Upload: louis-davidson

Post on 31-Dec-2015

18 views

Category:

Documents


0 download

DESCRIPTION

A Statistician's View of Upcoming Grand Challenges. Alanna Connors Imputed by Xiao-Li Meng Joint work with Alex Blocker, Paul Baines Vinay Kashyap, Pavlos Protopapas, and Andreas Zezas (all members of CBAS, a.k.a CHASC). I. Assessing Uncertainty When We Have No Idea What We Are Doing!. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Statistician's View  of Upcoming Grand Challenges

A Statistician's View A Statistician's View of Upcoming Grand of Upcoming Grand ChallengesChallenges

Alanna Connors Alanna Connors

Imputed by Xiao-Li MengImputed by Xiao-Li MengJoint work with Alex Blocker, Paul BainesJoint work with Alex Blocker, Paul Baines

Vinay Kashyap, Pavlos Protopapas, and Andreas Vinay Kashyap, Pavlos Protopapas, and Andreas ZezasZezas

(all members of CBAS, a.k.a CHASC) (all members of CBAS, a.k.a CHASC)

Page 2: A Statistician's View  of Upcoming Grand Challenges

I. Assessing Uncertainty When I. Assessing Uncertainty When We Have No Idea What We Are We Have No Idea What We Are

Doing!Doing! OK, maybe we know a little bit or a little piece OK, maybe we know a little bit or a little piece

of itof it Genuine replications are NOT possible Genuine replications are NOT possible Create Create pseudo-replicationspseudo-replications Bootstrap (the Green Book by Efron and Tibshirani, 1994)Bootstrap (the Green Book by Efron and Tibshirani, 1994)

Posterior Predictive Replications (Rubin, 1984, Posterior Predictive Replications (Rubin, 1984, Annals of Annals of StatisticsStatistics; Gelman, Meng and Stern, ; Gelman, Meng and Stern, Statistica SinicaStatistica Sinica, , 1996)1996)

Data Perturbation-- Data Perturbation-- taking derivative with respect to datataking derivative with respect to data

“ “On Measuring and Correcting the Effects of Data Mining On Measuring and Correcting the Effects of Data Mining and Model Selection and Model Selection ” ”

(J. Ye, 1998, 120-131, J. of American Statistical (J. Ye, 1998, 120-131, J. of American Statistical Association)Association)

Page 3: A Statistician's View  of Upcoming Grand Challenges

II. “Black Box” Inference and II. “Black Box” Inference and Computation Computation

The likelihood is given as a “black box” (either as The likelihood is given as a “black box” (either as a computer routine or a look-up table);a computer routine or a look-up table);

The prior is given the same way or we can The prior is given the same way or we can simulatesimulate from a prior; from a prior;

And we want samples from the Bayesain And we want samples from the Bayesain posterior.posterior.

Easy, right? Using Metropolis-Hasting, with prior Easy, right? Using Metropolis-Hasting, with prior as proposal, and likelihood as the M-H ratio … as proposal, and likelihood as the M-H ratio …

UselessUseless, since the posterior typically will be quite , since the posterior typically will be quite different (we at least hope!) from the prior, so the different (we at least hope!) from the prior, so the Markov chain won’t converge/mix, especially for Markov chain won’t converge/mix, especially for high-dimensional problems … high-dimensional problems …

So what do we do?So what do we do?

Page 4: A Statistician's View  of Upcoming Grand Challenges

We need to We need to adaptivelyadaptively blend blend many advanced many advanced methodsmethods

Parallel Tempering (Geyer, 1991, Parallel Tempering (Geyer, 1991, Proc. Proc. 23rd Symposium of CS & Stat Interface23rd Symposium of CS & Stat Interface) )

Equi-energy Sampling (Kou, Zhou & Wong, Equi-energy Sampling (Kou, Zhou & Wong, 2006, with discussions, 2006, with discussions, Annals of Annals of StatisticsStatistics))

Ancillarity-Sufficiency Interweaving Ancillarity-Sufficiency Interweaving Strategy (ASIS) (Yu and Meng, 2010, with Strategy (ASIS) (Yu and Meng, 2010, with discussions, J. discussions, J. Computational and Computational and Graphical StatisticsGraphical Statistics))

AND, we need to know how to “cut AND, we need to know how to “cut corners” …corners” …

Page 5: A Statistician's View  of Upcoming Grand Challenges

Example: Color-Magnitude Example: Color-Magnitude DiagramsDiagrams

(Baines, Zezas, Kashyap) (Baines, Zezas, Kashyap) Goal: Estimate the mass, age (and Goal: Estimate the mass, age (and

possibly metallicity) of a cluster of possibly metallicity) of a cluster of starsstars

Parameters: Mass, Age, MetallicityParameters: Mass, Age, MetallicityData: Photometric dataData: Photometric dataTheory/Likelihood: Isochrones Theory/Likelihood: Isochrones

(Tables)(Tables)The isochrones connect the The isochrones connect the

scientifically interesting parameters scientifically interesting parameters to the observed data via a to the observed data via a complicated mappingcomplicated mapping

Page 6: A Statistician's View  of Upcoming Grand Challenges

A Colorful But Ugly LikelihoodA Colorful But Ugly Likelihood

Page 7: A Statistician's View  of Upcoming Grand Challenges

We want an Equi-Energy (EE) Sampler We want an Equi-Energy (EE) Sampler … …

Jump between points of equal Jump between points of equal density/probabity (or “energy”)density/probabity (or “energy”)

Page 8: A Statistician's View  of Upcoming Grand Challenges

Approximate EE by “Equi-Approximate EE by “Equi-Expectation”Expectation”

Implementing the Equi-Energy Sampler in Implementing the Equi-Energy Sampler in high dimensions is impracticalhigh dimensions is impractical

Idea: Use the structure of the problem to Idea: Use the structure of the problem to construct a low-dimensional and efficient construct a low-dimensional and efficient approximation to EEapproximation to EE

For Gaussian-like data, “Equi-Expectation” For Gaussian-like data, “Equi-Expectation” clusters clusters approximate “approximate “Equi-Energy” Equi-Energy” clustersclusters

““Equi-Expectation” clusters are data (e.g. Equi-Expectation” clusters are data (e.g. star) independent, and hence require one-star) independent, and hence require one-time pre-MCMC steptime pre-MCMC step

Page 9: A Statistician's View  of Upcoming Grand Challenges

The original parameter space (e.g., magnitude & color)

Page 10: A Statistician's View  of Upcoming Grand Challenges

TThe “rocking boat” represents the “Expectation he “rocking boat” represents the “Expectation Space” Space”

Page 11: A Statistician's View  of Upcoming Grand Challenges

Clustering on the “Expectation Space”Clustering on the “Expectation Space”

Page 12: A Statistician's View  of Upcoming Grand Challenges

Creating approximate “equi-energy” clusters Creating approximate “equi-energy” clusters on the original space on the original space

Page 13: A Statistician's View  of Upcoming Grand Challenges

III. Many Frustrations!!! III. Many Frustrations!!! Outliers, really extreme ones!Outliers, really extreme ones!Large, Long tailed measurement errors Large, Long tailed measurement errors Strong dependence Strong dependence Non-linear trends (or whatever you want to call Non-linear trends (or whatever you want to call

them)them)Confounding signals (e.g., quasi-periodic)Confounding signals (e.g., quasi-periodic)High dimensions High dimensions Too much data Too much data Too many variables (large p, small n)Too many variables (large p, small n)Too little data (there is always ONE observable Too little data (there is always ONE observable

universe and ONE entire history!)universe and ONE entire history!)Too little funding, too little time … Too little funding, too little time …

Page 14: A Statistician's View  of Upcoming Grand Challenges

Example: Event Detection in Time Example: Event Detection in Time

SeriesSeries(Alex Blocker and Pavlos Protopapas)(Alex Blocker and Pavlos Protopapas)

Page 15: A Statistician's View  of Upcoming Grand Challenges

Use all your tools, but in the right Use all your tools, but in the right order! order!

Do some pre-processing (e.g., scan statistics) to Do some pre-processing (e.g., scan statistics) to reduce computational burden, but with GREAT reduce computational burden, but with GREAT CAUTION CAUTION

Be aware of the artifacts Be aware of the artifacts innocent-looking innocent-looking methods methods may introduce (e.g., spurious may introduce (e.g., spurious correlations); Always try on correlations); Always try on test datatest data first! first!

Let more rigorous statistical models to take care Let more rigorous statistical models to take care of complications of complications firstfirst whenever the computation whenever the computation is feasibleis feasible

Take advantage of more ad-hoc methods Take advantage of more ad-hoc methods when when signal is relative strong and computational gain is signal is relative strong and computational gain is greatgreat

Don’t forget to do model checking and Don’t forget to do model checking and uncertainty assessment via pseudo replications! uncertainty assessment via pseudo replications!

Page 16: A Statistician's View  of Upcoming Grand Challenges

A two-stage approach for event A two-stage approach for event detection detection

We fit a statistical model to separate low-We fit a statistical model to separate low-frequency trends L, median-frequency frequency trends L, median-frequency “candidates” M (event or quasi-periodic), and “candidates” M (event or quasi-periodic), and white noise N – we use t-model with small df (e.g. white noise N – we use t-model with small df (e.g. 3) to deal with outliers:3) to deal with outliers:

Y(t) = ∑aY(t) = ∑aiiMMii(t) + ∑b(t) + ∑bjjLLjj(t) + N(t)(t) + N(t)

Once the data are reduced to Once the data are reduced to cleaner (e.g., cleaner (e.g.,

outliers and non-linear trends removed) and lower outliers and non-linear trends removed) and lower dimension feature vector: dimension feature vector: { a{ aii, i=1, …, I}, we , i=1, …, I}, we can use a classifier to separate isolated events can use a classifier to separate isolated events from quasi-periodic by training on previously from quasi-periodic by training on previously identified light curves from each category. identified light curves from each category.

Page 17: A Statistician's View  of Upcoming Grand Challenges

Cutting Corners: even the simple Haar Cutting Corners: even the simple Haar wavelets might do the job … wavelets might do the job …

Page 18: A Statistician's View  of Upcoming Grand Challenges

The Grandest Challenge of All … The Grandest Challenge of All … We need many more future talents who We need many more future talents who

are passionate about quantitative sciences are passionate about quantitative sciences And who will stay away from the Wall And who will stay away from the Wall

Street regardless of the economy! Street regardless of the economy! So what do we do?So what do we do?Better teaching and training!Better teaching and training!““Desired and Feared– What Do We Do Now Desired and Feared– What Do We Do Now

and Over the Next 50 Years? ” and Over the Next 50 Years? ” Am. StatAm. Stat., ., 2009, Aug. 2009, Aug.

““Real-life statistics: Your Chance for Real-life statistics: Your Chance for Happiness (or Misery)” Happiness (or Misery)” Amstat NewsAmstat News, , 2009, Sept. 2009, Sept.

Page 19: A Statistician's View  of Upcoming Grand Challenges

1919

Page 20: A Statistician's View  of Upcoming Grand Challenges

They are intoxicated by …