long live (big data-fied) statistics!matloff/bdvis/jsm.pdf · (big data-fied) statistics! norm...

174
Long Live (Big Data-Fied) Statistics! Norm Matloff University of California at Davis Long Live (Big Data-Fied) Statistics! Norm Matloff University of California at Davis Joint Statistical Meetings Montreal, August 4, 2013

Upload: others

Post on 24-Sep-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Long Live (Big Data-Fied) Statistics!

Norm MatloffUniversity of California at Davis

Joint Statistical MeetingsMontreal, August 4, 2013

Page 2: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Prolog

Prolog

Where are we with Big Data?

• Role of statistics?

• Role of parallel computation?

• Interactions between the two?

Page 3: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Prolog

Prolog

Where are we with Big Data?

• Role of statistics?

• Role of parallel computation?

• Interactions between the two?

Page 4: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Where are we?

Attitudes and worries:

• “With Big Data, you don’t need inference methods.”

• “With Machine Learning, you don’t need statistics.”

• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).

Page 5: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Where are we?

Attitudes and worries:

• “With Big Data, you don’t need inference methods.”

• “With Machine Learning, you don’t need statistics.”

• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).

Page 6: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Where are we?

Attitudes and worries:

• “With Big Data, you don’t need inference methods.”

• “With Machine Learning, you don’t need statistics.”

• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).

Page 7: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Where are we?

Attitudes and worries:

• “With Big Data, you don’t need inference methods.”

• “With Machine Learning, you don’t need statistics.”

• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).

Page 8: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Where are we?

Attitudes and worries:

• “With Big Data, you don’t need inference methods.”

• “With Machine Learning, you don’t need statistics.”

• Stat community left out of the Big Data revolution (e.g.Amstat News, June 2013).

Page 9: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 10: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 11: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 12: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods.

And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 13: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 14: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 15: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away.

Impossibleto understand without stat.

Page 16: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Yes, stat is still needed!

Actually, stat is needed more than ever, e.g.:

• Inference is an issue even for big n, once one considerssubsets, where n becomes smaller. Same if do not havep << n.

• Almost all machine learning techniques are revivals of oldstat methods. And if you don’t understand stat, you won’tbe able to use ML methods effectively.

• An old stat technique—nonparametric curveestimation—now more useful than ever, for Big DataGraphics.

• The Curse of Dimensionality hasn’t gone away. Impossibleto understand without stat.

Page 17: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 18: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 19: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 20: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 21: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 22: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)

Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 23: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?

This talk will contain one “big n” section and two “big p”section.

Page 24: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Setting

Consider the classical (though not universal) data format:

• n observations/cases/instances/...

• p variables/features/attributes/...

• Assumed i.i.d.

(Here I’m trying to include terminology from the nonstatcommunities.)Does “Big” Data mean big n or big p or both?This talk will contain one “big n” section and two “big p”section.

Page 25: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big n

Part I: Big-n Problem

Page 26: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big n

Part I: Big-n Problem

Page 27: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big-n can be handled

Big n can generally be handled:

• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”

• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.

• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)

Page 28: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big-n can be handled

Big n can generally be handled:

• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”

• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.

• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)

Page 29: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big-n can be handled

Big n can generally be handled:

• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”

• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.

• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)

Page 30: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big-n can be handled

Big n can generally be handled:

• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”

• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.

• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)

Page 31: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big-n can be handled

Big n can generally be handled:

• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”

• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.

• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel.

(“Software alchemy.”)

Page 32: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big-n can be handled

Big n can generally be handled:

• Many (though certainly not all) computations “additive,”thus “embarrassingly parallel.”

• Thus amenable to parallel computation, especiallydistributed data, MapReduce etc.

• “Chunks averaging method” (CAM) (Fan et al, 2007;Matloff, 2010; etc.) can turn most statistical computationsinto embarrassingly parallel. (“Software alchemy.”)

Page 33: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 34: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 35: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 36: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.

• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 37: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.

• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 38: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 39: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 40: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 41: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup.

E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 42: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads.

Can be faster even for just one core.

Page 43: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Chunk Averaging Method

• Key point: CAM converts non-embaarrassingly parallelalgs to additive ones that are statistically equivalent(same standard errors).

• Example: Regression.

• Break observations into chunks.• Fit regression equation to each chunk.• Average the results.

• Produces statistically equivalent results for large n.

• Essentially and i.i.d.-based method, e.g. quantileregression, hazard function estimation, tree methods, etc.

• Superlinear speedup. E.g. quantile regression, 5.31X for 4threads. Can be faster even for just one core.

Page 44: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Part II

Part II: A New GraphicalApproach to Big-p

Problem

Page 45: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Part II

Part II: A New GraphicalApproach to Big-p

Problem

Page 46: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the

screen become solid black.

Page 47: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the

screen become solid black.

Page 48: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the

screen become solid black.

Page 49: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the

screen become solid black.

Page 50: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the

screen become solid black.

Page 51: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.

• “Black screen problem”—with big n, at least parts of thescreen become solid black.

Page 52: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Issues:

• Can deal (a little bit) better with big-p if we can displaylots of variables on the same graph.

• Can’t display all at once, but try to get at least several.

• Problems:

• Displaying > 2 variables on a 2-dimensional device.• “Black screen problem”—with big n, at least parts of the

screen become solid black.

Page 53: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 54: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 55: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 56: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours:

Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 57: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.

Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 58: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing.

But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 59: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 60: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates:

Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 61: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.

Hard to understand noncontiguous axes, black screenproblem.

Page 62: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Graphing Lots of Variables

Some existing methods:

• Grand tours: Show sequence of rotations and projections,e.g. tourr in R.Nice, visually appealing. But hard to discern exactrelations, and suffers from black-screen problem.

• Parallel coordinates: Draw one vertical axis for eachvariable. Draw a set of connecting lines for each datapoint.Hard to understand noncontiguous axes, black screenproblem.

Page 63: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 64: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:

nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 65: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.

Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 66: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 67: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 68: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 69: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 70: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Statistics to the rescue!

An obvious solution to the black-screen problem:nonparametric curve estimation.Example: Scatter plots.

• Ordinary plot would fill the screen.

• Solution: Draw the nonpar. 2-dim. density estimateinstead.

• That makes 3 dimensions, but code third dimension(density height) via color.

• E.g. scatterSmooth() in R.

Page 71: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 72: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:

Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 73: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 74: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 75: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 76: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 77: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.!

(No perspective plotting.)

Page 78: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Displaying 3 Vars. in 2 Dims.

The scatterSmooth() example actually shows how to display3 variables in 2 dimensions:Say have variables X, Y, Z.

• Plot regression function of Z, color coded, against X andY.

• Regression function: m(s,t) = E(Z | X = s, Y = t) (i.e.general, not assuming param. model).

• Use nonpar. estimation, e.g. nearest-neighbor.

• 3 vars. in 2 dims.! (No perspective plotting.)

Page 79: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 80: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims.,

based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 81: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”

First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 82: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 83: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 84: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 85: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 86: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Proposed Boundary Method

I introduce here a new approach to plotting multiple variablesin 2 dims., based on “boundary curves.”First, again consider 3 variables, X, Y and Z.

• For user-chosen b, boundary is the set

{(s, t) : E (Z |X = s,Y = t) = b} (1)

• User might set b = E(Z) (overall, unconditional mean).

• Plot estimate of the boundary curve.

• Displaying 3 vars. in 2 dims.

Page 87: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Boundary Plot Example

• Bank account data, UCI repository.

• X = age of customer, Y = current bank account, Z = sayYes to open new type of account

• b = EZ = P(Z = 1)

Page 88: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Boundary Plot Example

• Bank account data, UCI repository.

• X = age of customer, Y = current bank account, Z = sayYes to open new type of account

• b = EZ = P(Z = 1)

Page 89: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Boundary Plot Example

• Bank account data, UCI repository.

• X = age of customer, Y = current bank account, Z = sayYes to open new type of account

• b = EZ = P(Z = 1)

Page 90: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Boundary Plot Example

• Bank account data, UCI repository.

• X = age of customer, Y = current bank account, Z = sayYes to open new type of account

• b = EZ = P(Z = 1)

Page 91: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Bank Example

• Above linemeans,above-avg.prob. sign upfor newaccount.

• Near retire ⇒“hardest sell”!

• Those around60 need alarge balancebefore willingto try newaccount.

Page 92: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Bank Example

• Above linemeans,above-avg.prob. sign upfor newaccount.

• Near retire ⇒“hardest sell”!

• Those around60 need alarge balancebefore willingto try newaccount.

Page 93: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Bank Example

• Above linemeans,above-avg.prob. sign upfor newaccount.

• Near retire ⇒“hardest sell”!

• Those around60 need alarge balancebefore willingto try newaccount.

Page 94: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Bank Example

• Above linemeans,above-avg.prob. sign upfor newaccount.

• Near retire ⇒“hardest sell”!

• Those around60 need alarge balancebefore willingto try newaccount.

Page 95: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Bank Example

• Above linemeans,above-avg.prob. sign upfor newaccount.

• Near retire ⇒“hardest sell”!

• Those around60 need alarge balancebefore willingto try newaccount.

Page 96: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

More Than 3 Vars. in 2 Dims.

• Plotting boundaries has been done before.

• But the idea here is to display several boundaries at once,so as to display more variables in one 2-dim. graph.

Page 97: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

More Than 3 Vars. in 2 Dims.

• Plotting boundaries has been done before.

• But the idea here is to display several boundaries at once,so as to display more variables in one 2-dim. graph.

Page 98: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

More Than 3 Vars. in 2 Dims.

• Plotting boundaries has been done before.

• But the idea here is to display several boundaries at once,so as to display more variables in one 2-dim. graph.

Page 99: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Adult Data

• UCI Adult data

• X = age, Y = education, Z = high income

• But now add a 4th variable: W = gender

• Plot 2 boundary curves, one male and one female.

• Thus display 4 variables in 2 dims.

Page 100: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Adult Data

• UCI Adult data

• X = age, Y = education, Z = high income

• But now add a 4th variable: W = gender

• Plot 2 boundary curves, one male and one female.

• Thus display 4 variables in 2 dims.

Page 101: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Adult Data

• UCI Adult data

• X = age, Y = education, Z = high income

• But now add a 4th variable: W = gender

• Plot 2 boundary curves, one male and one female.

• Thus display 4 variables in 2 dims.

Page 102: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Adult Data

• UCI Adult data

• X = age, Y = education, Z = high income

• But now add a 4th variable: W = gender

• Plot 2 boundary curves, one male and one female.

• Thus display 4 variables in 2 dims.

Page 103: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Adult Data

• UCI Adult data

• X = age, Y = education, Z = high income

• But now add a 4th variable: W = gender

• Plot 2 boundary curves, one male and one female.

• Thus display 4 variables in 2 dims.

Page 104: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Adult Data

• UCI Adult data

• X = age, Y = education, Z = high income

• But now add a 4th variable: W = gender

• Plot 2 boundary curves, one male and one female.

• Thus display 4 variables in 2 dims.

Page 105: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Adult Example

• Above linemeans,higher-than-avg. prob. ofhigh income.

• Before age35, not muchdifference.

• After age 35,women needmuch moreeducationthan men tolikely havehigh income.

Page 106: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Adult Example

• Above linemeans,higher-than-avg. prob. ofhigh income.

• Before age35, not muchdifference.

• After age 35,women needmuch moreeducationthan men tolikely havehigh income.

Page 107: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Adult Example

• Above linemeans,higher-than-avg. prob. ofhigh income.

• Before age35, not muchdifference.

• After age 35,women needmuch moreeducationthan men tolikely havehigh income.

Page 108: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Adult Example

• Above linemeans,higher-than-avg. prob. ofhigh income.

• Before age35, not muchdifference.

• After age 35,women needmuch moreeducationthan men tolikely havehigh income.

Page 109: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Adult Example

• Above linemeans,higher-than-avg. prob. ofhigh income.

• Before age35, not muchdifference.

• After age 35,women needmuch moreeducationthan men tolikely havehigh income.

Page 110: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Flight Lateness

• Airline lateness data.

• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.

• 3 curves, one for each airport

• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.

• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”

Page 111: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Flight Lateness

• Airline lateness data.

• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.

• 3 curves, one for each airport

• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.

• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”

Page 112: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Flight Lateness

• Airline lateness data.

• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH),

so again,displaying 4 variables in 2 dims.

• 3 curves, one for each airport

• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.

• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”

Page 113: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Flight Lateness

• Airline lateness data.

• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.

• 3 curves, one for each airport

• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.

• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”

Page 114: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Flight Lateness

• Airline lateness data.

• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.

• 3 curves, one for each airport

• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.

• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”

Page 115: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Example: Flight Lateness

• Airline lateness data.

• X = departure delay, Y = distance, Z = arrival lateness,W = originating airport (here, SFO, IAD, IAH), so again,displaying 4 variables in 2 dims.

• 3 curves, one for each airport

• Could add V = daytime vs. evening, for 6 curves, thusdisplaying 5 variables in 2 dims.

• Could plot straight regressions too, but boundaries alwaysenable us to plot “one more variable.”

Page 116: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Airline Example

• Above linemeans,higher-than-avg. meandelay.

• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.

Page 117: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Airline Example

• Above linemeans,higher-than-avg. meandelay.

• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.

Page 118: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Airline Example

• Above linemeans,higher-than-avg. meandelay.

• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.

Page 119: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Airline Example

• Above linemeans,higher-than-avg. meandelay.

• SFO seems tobe doingbetter.

Needa very longflight to haveabove-avg.delay, relativeto the others.

Page 120: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Airline Example

• Above linemeans,higher-than-avg. meandelay.

• SFO seems tobe doingbetter. Needa very longflight to haveabove-avg.delay, relativeto the others.

Page 121: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Computation

• R’s (contour() not used (don’t want “islands”).

• Estimate regression (via fast kNN, FNN library).

• Find “boundary band,” all points near the estimateboundary.

• Smooth the band.

Page 122: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Computation

• R’s (contour() not used (don’t want “islands”).

• Estimate regression (via fast kNN, FNN library).

• Find “boundary band,” all points near the estimateboundary.

• Smooth the band.

Page 123: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Computation

• R’s (contour() not used (don’t want “islands”).

• Estimate regression (via fast kNN, FNN library).

• Find “boundary band,” all points near the estimateboundary.

• Smooth the band.

Page 124: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Computation

• R’s (contour() not used (don’t want “islands”).

• Estimate regression (via fast kNN, FNN library).

• Find “boundary band,” all points near the estimateboundary.

• Smooth the band.

Page 125: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Computation

• R’s (contour() not used (don’t want “islands”).

• Estimate regression (via fast kNN, FNN library).

• Find “boundary band,” all points near the estimateboundary.

• Smooth the band.

Page 126: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation

Computation can be voluminous.

• Parallel processing.

• Take advantage of superlinearity from CAM.

• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.

• The“A” part of CAM comes in the smoothing of the band.

Page 127: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation

Computation can be voluminous.

• Parallel processing.

• Take advantage of superlinearity from CAM.

• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.

• The“A” part of CAM comes in the smoothing of the band.

Page 128: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation

Computation can be voluminous.

• Parallel processing.

• Take advantage of superlinearity from CAM.

• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.

• The“A” part of CAM comes in the smoothing of the band.

Page 129: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation

Computation can be voluminous.

• Parallel processing.

• Take advantage of superlinearity from CAM.

• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.

• The“A” part of CAM comes in the smoothing of the band.

Page 130: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation

Computation can be voluminous.

• Parallel processing.

• Take advantage of superlinearity from CAM.

• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.

• The“A” part of CAM comes in the smoothing of the band.

Page 131: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Parallel Computation

Computation can be voluminous.

• Parallel processing.

• Take advantage of superlinearity from CAM.

• Break into chunks, but only find near nghbrs. withinchunks, not across chunks.

• The“A” part of CAM comes in the smoothing of the band.

Page 132: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Part III

Part III: Big p and theCurse of DimensionalityExorcizing the Curse of Dimensionality

Some small steps in that direction.

Page 133: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Part III

Part III: Big p and theCurse of DimensionalityExorcizing the Curse of DimensionalitySome small steps in that direction.

Page 134: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 135: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 136: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 137: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 138: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.”

I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 139: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 140: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Big p

• Theoretical considerations imply that should have p <√n

in regression case (Portnoy, 1968).

• Yet today p >> n is commonplace.

• This causes “multiple inference” problems (e.g. familywiseerror rates).

• So, e.g., CI radii 1.96 std.err.(θ̂) might NOT be“essentially 0.” I.e., Big n not big after all.

• And the ever-present Curse of Dimensionality.

Page 141: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Principle Components Analysis

• What sizes of p relative to n might be problematic forPCA?

• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np

= p(p-1)/2.• So, have problem if p > 2n, roughly.

Page 142: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Principle Components Analysis

• What sizes of p relative to n might be problematic forPCA?

• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np

= p(p-1)/2.• So, have problem if p > 2n, roughly.

Page 143: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Principle Components Analysis

• What sizes of p relative to n might be problematic forPCA?

• Sample covariance matrix V has p(p-1)/2 distinct entries.

• Data matrix has np entries.• So V is completely determined (except roundoff error) if np

= p(p-1)/2.• So, have problem if p > 2n, roughly.

Page 144: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Principle Components Analysis

• What sizes of p relative to n might be problematic forPCA?

• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.

• So V is completely determined (except roundoff error) if np= p(p-1)/2.

• So, have problem if p > 2n, roughly.

Page 145: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Principle Components Analysis

• What sizes of p relative to n might be problematic forPCA?

• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np

= p(p-1)/2.

• So, have problem if p > 2n, roughly.

Page 146: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Principle Components Analysis

• What sizes of p relative to n might be problematic forPCA?

• Sample covariance matrix V has p(p-1)/2 distinct entries.• Data matrix has np entries.• So V is completely determined (except roundoff error) if np

= p(p-1)/2.• So, have problem if p > 2n, roughly.

Page 147: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

PCA Experiment

Simulation experiment:

• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.

• First PC should be (1,0,0,...) or (0,1,0,...).

> s imfunct ion ( n , p ) {

y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−

cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(

abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}

Page 148: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

PCA Experiment

Simulation experiment:

• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.

• First PC should be (1,0,0,...) or (0,1,0,...).

> s imfunct ion ( n , p ) {

y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−

cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(

abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}

Page 149: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

PCA Experiment

Simulation experiment:

• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.

• First PC should be (1,0,0,...) or (0,1,0,...).

> s imfunct ion ( n , p ) {

y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−

cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(

abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}

Page 150: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

PCA Experiment

Simulation experiment:

• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.

• First PC should be (1,0,0,...) or (0,1,0,...).

> s imfunct ion ( n , p ) {

y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−

cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(

abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}

Page 151: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

PCA Experiment

Simulation experiment:

• Y1, Y2 indep. N(0,1); X1 = Y1 + Y2, X2 = Y1 − Y2,X3, ...,Xp iid N(0,1), indep. of X1, X2.

• First PC should be (1,0,0,...) or (0,1,0,...).

> s imfunct ion ( n , p ) {

y1 <− rnorm ( n ) ; y2 <− rnorm ( n ) ;x1 <− y1+y2 ; x2 <− y1−y2 ; p2 <− p − 2x <−

cbind ( x1 , x2 , matrix ( rnorm ( n∗p2 ) , ncol=p2 ) )c v r <− cov ( x )which .max(

abs ( eigen ( cvr , symmetr ic=T)$ v e c t o r s [ , 1 ] ) )}

Page 152: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Simulation, cont’d.

Return value from sim() should be 1 or 2. Let’s see:

> s im ( 5 0 0 , 4 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 439> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 1> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 905

Page 153: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Simulation, cont’d.

Return value from sim() should be 1 or 2. Let’s see:

> s im ( 5 0 0 , 4 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 1> s im ( 5 0 0 , 8 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 439> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 2> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 1> s im ( 5 0 0 , 1 2 0 0 )[ 1 ] 905

Page 154: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Simulation, cont’d.

• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.

• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.

• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”

Page 155: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Simulation, cont’d.

• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.

• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.

• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”

Page 156: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Simulation, cont’d.

• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.

• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.

• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”

Page 157: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Simulation, cont’d.

• When n < p/2—very common in practice!—sometimesright but sometimes get phantom PCs.

• On the other hand, results of Johnstone (2000) suggestthat as long as n > p/2 we might be OK.

• Moreover, in practice the variables are correlated, oftenvery highly so, in regular patterns. I suspect this makes it“more OK.”

Page 158: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 159: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 160: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier:

Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 161: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 162: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 163: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 164: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.

•√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 165: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 166: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Exorcizing the Curse?

• The term curse of dimensionality goes back 50 years.

• In last 10-15 years, it has gotten scarier: Berry(1999)proved that any 2 points are approximately the samedistance from each other!

• So, e.g., nearest-neighbor methods look iffy.

• My own rough derivation:

• Suppose the p distance components are iid.•

√Var(distance)/E (distance)→ 0 as p− >∞

• So, distances are approximately constant.

Page 167: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Some Hope

Some Hope:

• But all that involves equally-weighted components indistance.

• Yet, arguably we should have weights, according toimportance of the variables.

• Then the above problem goes away. (Coef. of var. doesnot go to 0.)

• But how set the weights?

• Stay tuned...

Page 168: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Some Hope

Some Hope:

• But all that involves equally-weighted components indistance.

• Yet, arguably we should have weights, according toimportance of the variables.

• Then the above problem goes away. (Coef. of var. doesnot go to 0.)

• But how set the weights?

• Stay tuned...

Page 169: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Some Hope

Some Hope:

• But all that involves equally-weighted components indistance.

• Yet, arguably we should have weights, according toimportance of the variables.

• Then the above problem goes away. (Coef. of var. doesnot go to 0.)

• But how set the weights?

• Stay tuned...

Page 170: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Some Hope

Some Hope:

• But all that involves equally-weighted components indistance.

• Yet, arguably we should have weights, according toimportance of the variables.

• Then the above problem goes away. (Coef. of var. doesnot go to 0.)

• But how set the weights?

• Stay tuned...

Page 171: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Some Hope

Some Hope:

• But all that involves equally-weighted components indistance.

• Yet, arguably we should have weights, according toimportance of the variables.

• Then the above problem goes away. (Coef. of var. doesnot go to 0.)

• But how set the weights?

• Stay tuned...

Page 172: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Some Hope

Some Hope:

• But all that involves equally-weighted components indistance.

• Yet, arguably we should have weights, according toimportance of the variables.

• Then the above problem goes away. (Coef. of var. doesnot go to 0.)

• But how set the weights?

• Stay tuned...

Page 173: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Misc.

Online materiasl:The visualization code is available for your use andcomments/suggestions:http://heather.cs.ucdavis.edu/BigDataVis.html

These slides are there too.

Acknowlegements:The author would like to thank Noah Gift, Marnie Dunsmore,Nicholas Lewin-Koh, and David Scott for helpful discussions,and Hao Chen and Bill Hsu for use of their high-performancecomputing equipment.

Page 174: Long Live (Big Data-Fied) Statistics!matloff/BDVis/JSM.pdf · (Big Data-Fied) Statistics! Norm Matlo University of California at Davis Where are we? Attitudes and worries: \With Big

Long Live(Big

Data-Fied)Statistics!

Norm MatloffUniversity ofCalifornia at

Davis

Misc.

Online materiasl:The visualization code is available for your use andcomments/suggestions:http://heather.cs.ucdavis.edu/BigDataVis.html

These slides are there too.

Acknowlegements:The author would like to thank Noah Gift, Marnie Dunsmore,Nicholas Lewin-Koh, and David Scott for helpful discussions,and Hao Chen and Bill Hsu for use of their high-performancecomputing equipment.