object orie’d data analysis, last time

38
Object Orie’d Data Analysis, Last Time • Cornea Data & Robust PCA – Elliptical PCA • Big Picture PCA – Optimization View – Gaussian Likelihood View – Correlation PCA • Finding Clusters with PCA – Can be useful – But can also miss some

Upload: milly

Post on 11-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

Object Orie’d Data Analysis, Last Time. Cornea Data & Robust PCA Elliptical PCA Big Picture PCA Optimization View Gaussian Likelihood View Correlation PCA Finding Clusters with PCA Can be useful But can also miss some. Participant Presentations. 36 participants ~ 15 minutes each - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Object Orie’d Data Analysis, Last Time

Object Orie’d Data Analysis, Last Time

• Cornea Data & Robust PCA– Elliptical PCA

• Big Picture PCA– Optimization View– Gaussian Likelihood View– Correlation PCA

• Finding Clusters with PCA– Can be useful– But can also miss some

Page 2: Object Orie’d Data Analysis, Last Time

Participant Presentations• 36 participants• ~ 15 minutes each• 4 per class meeting• Requires 9 class meetings …• Have revised Schedule on Class Web Page• Please sign up (email) for preferred time

(1st come, 1st served)• Please help for next week(if you can just pull something off the shelf)

Page 3: Object Orie’d Data Analysis, Last Time

PCA to find clustersPCA of Mass Flux Data:

Page 4: Object Orie’d Data Analysis, Last Time

PCA to find clustersReturn to Investigation of PC1 Clusters:• Can see 3 bumps in smooth

histogramMain Question:

Important structureor

sampling variability?

Approach: SiZer(SIgnificance of ZERo crossings of deriv.)

Page 5: Object Orie’d Data Analysis, Last Time

SiZer BackgroundTwo Major Settings: • 2-d scatterplot smoothing • 1-d histograms

(continuous data, not discrete bar plots)

Central Question:Which features are really there?

• Solution, Part 1: Scale space• Solution, Part 2: SiZer

Page 6: Object Orie’d Data Analysis, Last Time

SiZer BackgroundBralower’s Fossil Data - Global

Climate

Page 7: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSmooths - Suggest Structure -

Real?

Page 8: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSmooths of Fossil Data (details given

later)• Dotted line: undersmoothed

(feels sampling variability)• Dashed line: oversmoothed

(important features missed?)• Solid line: smoothed about right?

Central question:Which features are “really there”?

Page 9: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSmoothing Setting 2: Histograms

Family Income Data: British Family Expenditure Survey

• For the year 1975 • Distribution of Family Incomes • ~ 7000 families

Page 10: Object Orie’d Data Analysis, Last Time

SiZer BackgroundFamily Income Data, Histogram Analysis: • Again under- and over- smoothing

issues • Perhaps 2 modes in data? • Histogram Problem 1:

Binwidth (well known)

Central question:Which features are “really there”?

• e.g. 2 modes?• Same problem as existence of

“clusters” in PCA

Page 11: Object Orie’d Data Analysis, Last Time

SiZer BackgroundWhy not use (conventional)

histograms?

Histogram Problem 2: Bin shift (less well known)

• For same binwidth • Get much different impression • By only “shifting grid location“• Get it right by chance?

Page 12: Object Orie’d Data Analysis, Last Time

SiZer BackgroundWhy not use (conventional) histograms? • Solution to binshift problem:

Average over all shifts • 1st peak all in one bin: bimodal • 1st peak split between bins:

unimodal Smooth histo’m provides understanding,

So should use for data analysisAnother name: Kernel Density Estimate

Page 13: Object Orie’d Data Analysis, Last Time

SiZer BackgroundWhy not use (conventional) histograms? • Solution to binshift problem:

Average over all shifts • 1st peak all in one bin: bimodal • 1st peak split between bins:

unimodal Smooth histo’m provides understanding,

So should use for data analysisAnother name: Kernel Density Estimate

Page 14: Object Orie’d Data Analysis, Last Time

SiZer BackgroundKernel density estimation Recommended Reference ( many

books):Wand, M. P. and Jones, M. C. (1995)

View 1: Smooth histogram View 2: Distribute probability mass,

according to data E.g. Chondrite data

(from how many sources?)

Page 15: Object Orie’d Data Analysis, Last Time

SiZer BackgroundKernel density estimation (cont.) • Central Issue: width of window, i.e.

“bandwidth”, • E.g. Incomes data: Controls critical amount of smoothing

• Old Approach: Data based bandwidth selection

• Recommended reference: Jones et al (1996)

• New Approach: "scale space" (look at all of them)

h

Page 16: Object Orie’d Data Analysis, Last Time

SiZer BackgroundScale Space – Idea from Computer Vision • Conceptual basis:

Oversmoothing = “view from afar”(macroscopic)

Undersmoothing = “zoomed in view”(microscopic)

Main idea: all smooths contain useful information, so study “full spectrum”

(i. e. all smoothing levels)Recommended reference: Lindeberg

(1994)

Page 17: Object Orie’d Data Analysis, Last Time

SiZer BackgroundFun Scale Spaces Views

(of Family Incomes Data)

Spectrum Movie

Page 18: Object Orie’d Data Analysis, Last Time

SiZer BackgroundFun Scale Spaces Views (Incomes

Data)Spectrum Overlay

Page 19: Object Orie’d Data Analysis, Last Time

SiZer BackgroundFun Scale Spaces Views (Incomes

Data)Surface View

Page 20: Object Orie’d Data Analysis, Last Time

SiZer BackgroundFun Scale Spaces Views

(of Family Incomes Data)

Note: The scale space viewpoint makes

Data Dased Bandwidth SelectionMuch less important

(than I once thought….)

Page 21: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSiZer: • Significance of Zero crossings, of

the derivative, in scale space• Combines:

– needed statistical inference – novel visualization

• To get: a powerful exploratory data analysis method

• Main reference:Chaudhuri & Marron (1999)

Page 22: Object Orie’d Data Analysis, Last Time

SiZer BackgroundBasic idea: a bump is characterized by: • an increase• followed by a decrease

Generalization: Many features of interest captured by sign of the slope of the smooth

Foundation of SiZer: Statistical inference on slopes,

over scale space

Page 23: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSiZer Visual presentation: • Color map over scale space: • Blue: slope significantly upwards

(derivative CI above 0) • Red: slope significantly downwards

(derivative CI below 0) • Purple: slope insignificant

(derivative CI contains 0)

Page 24: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSiZer analysis of Fossils data: • Upper Left: Scatterplot, family of

smooths, 1 highlighted• Upper Right: Scale space rep’n of

family, with SiZer colors • Lower Left: SiZer map, more easy

to view • Lower Right: SiCon map – replace

slope by curvature • Slider (in movie viewer) highlights

different smoothing levels

Page 25: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSiZer analysis of Fossils data (cont.): Oversmoothed (top of SiZer map): • Decreases at left, not on right Medium smoothed (middle of SiZer map): • Main valley significant, and left most

increase • Smaller valley not statistically

significant Undersmoothed (bottom of SiZer map): • “noise wiggles” not significant Additional SiZer color: gray - not enough

data for inference

Page 26: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSiZer analysis of Fossils data (cont.): Common Question: Which is right?• Decreases on left, then flat

(top of SiZer map)• Up, then down, then up again

(middle of SiZer map)• No significant features

(bottom of SiZer map)Answer: All are right• Just different scales of view, • i.e. levels of resolution of data

Page 27: Object Orie’d Data Analysis, Last Time

SiZer BackgroundSiZer analysis of British Incomes data: • Oversmoothed: Only one mode • Medium smoothed: Two modes,

statistically significant Confirmed by Schmitz & Marron, (1992)• Undersmoothed: many noise

wiggles, not significant Again: all are correct,

just different scales

Page 28: Object Orie’d Data Analysis, Last Time

SiZer Background

Historical Note & Acknowledgements:

Scale Space: S. M. Pizer

SiZer: Probal Chaudhuri

Main Reference: Chaudhuri & Marron

(1999)

Page 29: Object Orie’d Data Analysis, Last Time

SiZer BackgroundToy E.g. - Marron & Wand Trimodal

#9

Increasing n

Only 1 Signif’t Mode

Now 2 Signif’t Modes

Finally all 3 Modes

Page 30: Object Orie’d Data Analysis, Last Time

SiZer BackgroundE.g. - Marron & Wand Discrete Comb

#15

Increasing n

Coarse Bumps Only

Now Fine bumps too

Someday: “draw” localBandwidth on SiZer map

Page 31: Object Orie’d Data Analysis, Last Time

SiZer BackgroundFinance "tick data":

(time, price) of single stock transactions

Idea: "on line" version of SiZerfor viewing and understanding trends

Notes: • trends depend heavily on scale • double points and more • background color transition

(flop over at top)

Page 32: Object Orie’d Data Analysis, Last Time

SiZer BackgroundInternet traffic data analysis:

SiZer analysis oftime series of packet times

at internet hub (UNC)• across very wide range of scales • needs more pixels than screen allows • thus do zooming view

(zoom in over time) – zoom in to yellow bd’ry in next frame – readjust vertical axis

Page 33: Object Orie’d Data Analysis, Last Time

SiZer BackgroundInternet traffic data analysis (cont.)

Insights from SiZer analysis:• Coarse scales:

amazing amount of significant structure • Evidence of self-similar fractal type

process? • Fewer significant features at small scales • But they exist, so not Poisson process • Poisson approximation OK at small scale??? • Smooths (top part) stable at large scales?

Page 34: Object Orie’d Data Analysis, Last Time

Dependent SiZerPark, Marron, and Rondonotti (2004)• SiZer compares data with white noise• Inappropriate in time series• Dependent SiZer compares data with

an assumed model• Visual Goodness of Fit test

Page 35: Object Orie’d Data Analysis, Last Time

Dep’ent SiZer : 2002 Apr 13 Sat 1 pm – 3 pm

Page 36: Object Orie’d Data Analysis, Last Time

Zoomed view (to red region, i.e. “flat top”)

Page 37: Object Orie’d Data Analysis, Last Time

Further Zoom: finds very periodic behavior!

Page 38: Object Orie’d Data Analysis, Last Time

Possible Physical Explanation

IP “Port Scan”• Common device of hackers• Searching for “break in points”• Send query to every possible

(within UNC domain):– IP address– Port Number

• Replies can indicate system weaknesses

Internet Traffic is hard to model