object orie’d data analysis, last time cornea data –images (on the disk) as data objects...

54
Object Orie’d Data Analysis, Last Time • Cornea Data – Images (on the disk) as data objects – Zernike basis representations • Outliers in PCA (have major influence) • Robust PCA (downweight outliers) – Eigen-analysis of robust covariance matrix – Projection Pursuit – Spherical PCA

Upload: anastasia-daniel

Post on 20-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Object Orie’d Data Analysis, Last Time

• Cornea Data– Images (on the disk) as data objects

– Zernike basis representations

• Outliers in PCA (have major influence)

• Robust PCA (downweight outliers)– Eigen-analysis of robust covariance matrix

– Projection Pursuit

– Spherical PCA

Page 2: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Cornea DataCornea: Outer surface of the eye

Driver of Vision: Curvature of CorneaSequence of Images

Objects: Images on the unit diskCurvature as “Heat Map”

Special Thanks to K. L. Cohen, N. Tripoli,UNC Ophthalmology

Page 3: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA of Cornea Data

PC2 Affected by Outlier:

How bad is this problem?

View 1: Statistician: Arrggghh!!!!

• Outliers are very dangerous

• Can give arbitrary and meaningless

dir’ns

• What does 4% of MR SS mean???

Page 4: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAWhat is multivariate median?• There are several! (“median” generalizes in different ways)i. Coordinate-wise median • Often worst • Not rotation invariant

(2-d data uniform on “L”)• Can lie on convex hull of data

(same example)• Thus poor notion of “center”

di

i

Xmedian

Xmedian

,

1,

Page 5: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCA M-estimate:“Slide sphere around until mean (of projected data) is at center”

1L

Page 6: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCA M-estimate (cont.):Additional literature:Called “geometric median” (long before

Huber) by: Haldane (1948)Shown unique for by: Milasevic

and Ducharme (1987) Useful iterative algorithm: Gower (1974)

(see also Sec. 3.2 of Huber).Cornea Data experience:

works well for

1L

1d

66d

Page 7: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCA

Approaches to Robust PCA:

1. Robust Estimation of Covariance

Matrix

2. Projection Pursuit

3. Spherical PCA

Page 8: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCARobust PCA 3:

Spherical PCA

Page 9: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCARobust PCA 3: Spherical PCA• Idea: use “projection to sphere”

idea from M-estimation• In particular project data to

centered sphere • “Hot Dog” of data becomes “Ice

Caps”

• Easily found by PCA (on proj’d data)• Outliers pulled in to reduce

influence• Radius of sphere unimportant

Page 10: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCASpherical PCA for Toy Example:

Curve DataWith anOutlier

First recallConventionalPCA

Page 11: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCASpherical PCA for Toy Example:

Now doSphericalPCA

Better result?

Page 12: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCASpherical PCA for Toy Data:• Mean looks “smoother”• PC1 nearly “flat” (unaffected by

outlier)• PC2 is nearly “tilt” (again unaffected

by outlier)• PC3 finally strongly driven by outlier• OK, since all other directions “about

equal in variation” • Energy Plot, no longer ordered

(outlier drives SS, but not directions)

Page 13: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCASpherical PCA for Toy Example:

Check outLaterComponents

Page 14: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAHigher Order Components:• Neither PC3 nor PC4 catch all of outlier• An effect of this down-weighting

method• PC lines show power of sphered

data• PC symbols show “SS for curves”

(reflects visual impression)• Latter are not monotonic!• Reflects reduced influence property of

spherical PCA

2R2R

Page 15: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCARecall M-estimate for Cornea Data:

Sample Mean M-estimate• Definite improvement• But outliers still have some influence• Projection onto sphere distorts the

data

1L

Page 16: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAUseful View: Parallel Coordinates Plot

X-axis:ZernikeCoefficientNumber

Y-axis:Coefficient

Page 17: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCACornea Data, Parallel Coordinates

Plot:

Top Plot: ZernikeCoefficients

All n = 43 verySimilar

Most Action in fewLow Freq. Coeffs.

Page 18: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCACornea Data, Parallel Coordinates

Plot Middle Plot: Zernike Coefficients –

median• Most Variation in lowest frequencies• E.g. as in Fourier compression of

smooth signals• Projecting on sphere will destroy this• By magnifying high frequency behavior

Bottom Plot: discussed later

Page 19: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCASpherical PCA Problem :

Magnification of High Freq. Coeff’s

Solution : Elliptical Analysis

• Main idea: project data onto suitable ellipse, not sphere

• Which ellipse? (in general, this is problem that PCA solves!)

• Simplification: Consider ellipses parallel to coordinate axes

Page 20: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCARescale

Coords

Unscale

Coords

Spherical PCA

Page 21: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAElliptical Analysis (cont.):• Simple Implementation,

via coordinate axis rescaling• Divide each axis by MAD• Project Data to sphere (in transformed

space)• Return to original space (mul’ply by orig’l

MAD) for analysis• Where MAD = Median Absolute

Deviation (from median)(simple, high breakdown,

outlier resistant, measure of “scale”)

ii xmedianxmedian

Page 22: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAElliptical Estimate of “center”:Do M-estimation in transformed space

(then transform back)Results for cornea data:

Sample Mean Spherical Center Elliptical Center

• Elliptical clearly best• Nearly no edge effect

1L

Page 23: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAElliptical PCA for cornea data:

Original PC1, Elliptical PC1• Still finds overall curvature &

correlated astigmatism• Minor edge effects almost

completely goneOriginal PC2, Elliptical PC2

• Huge edge effects dramatically reduced

• Still finds steeper superior vs. inferior

Page 24: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCAElliptical PCA for Cornea Data (cont.):

Original PC3, Elliptical PC3• -Edge effects greatly diminished• But some of against the rule

astigmatism also lost• Price paid for robustness

Original PC4, Elliptical PC4• Now looks more like variation on

astigmatism???

Page 25: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCACurrent state of the art:

• Spherical & Elliptical PCA are a kludge

• Put together by Robustness Amateurs

• To solve this HDLSS problem

• Good News: Robustness Pros are now in the game:

Hubert, et al (2005)

Page 26: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Robust PCADisclaimer on robust analy’s of Cornea

Data:

• Critical parameter is “radius of analysis”,

• : Shown above, Elliptical PCA very effective

• : Stronger edge effects, Elliptical PCA less useful

• : Edge effects weaker, don’t need robust PCA

mmR 40

mmR 2.40

mmR 5.30

Page 27: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Big Picture View of PCAAbove View:

PCA finds optimal directions in point cloud

• Maximize projected variation• Minimize residual variation

(same by Pythagorean Theorem)Notes:• Get useful insights about data• Shows can compute for any point

cloud• But there are other views.

Page 28: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Big Picture View of PCAAlternate Viewpoint: Gaussian

Likelihood

• When data are multivariate Gaussian

• PCA finds major axes of ellipt’al contours

of Probability Density

• Maximum Likelihood Estimate

Mistaken idea:

PCA only useful for Gaussian data

Page 29: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Big Picture View of PCASimple check for Gaussian distribution: • Standardized parallel coordinate plot• Subtract coordinate wise median

(robust version of mean)(not good as “point cloud center”,

but now only looking at coordinates)• Divide by MAD / MAD(N(0,1))

(put on same scale as “standard deviation”)

• See if data stays in range –3 to +3

Page 30: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Big Picture View of PCAE.g.Cornea Data:

StandardizedParallel CoordinatePlot

Shown before

Page 31: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Big Picture View of PCARaw Cornea

Data:

Data – Median

(Data – Mean)------------------- Median

Page 32: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Big Picture View of PCACheck for Gaussian dist’n:

Stand’zed Parallel Coord. Plot• E.g. Cornea data

(recall image view of data)• Several data points > 20 “s.d.s”

from the center• Distribution clearly not Gaussian• Strong kurtosis (“heavy tailed”)• But PCA still gave strong insights

Page 33: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCAA related (& better known) variation of

PCA:Replace cov. matrix with correlation

matrixI.e. do eigen analysis of

Where

1

1

,1,1

,1

2,1

,12,1

ddd

dd

d

R

ji

jiji XX

XX

varvar

,cov,

Page 34: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCAWhy use correlation matrix?

Reason 1: makes features “unit free”

• e.g. M-reps: mix “lengths” with “angles”

(degrees? radians?)• Are “directions in point cloud”

meaningful or useful?• Will unimportant directions

dominate?

Page 35: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCAAlternate view of correlation PCA:

Ordinary PCA on standardized (whitened) data

I.e. SVD of data matrix

Distorts “point cloud” along coord. dir’ns

d

dnd

d

dd

n

s

XX

s

XX

s

XX

s

XX

nX

,1,

1

1,1

1

11,1

1

1~~

Page 36: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCAReason 2 for correlation PCA:• “Whitening” can be a useful operation

(e.g. M-rep Corp. Call. data)

• Caution: sometimes this is not helpful

(can lose important structure this way)

E.g. 1: Cornea dataElliptical vs. Spherical PCA

Page 37: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCAE.g. 2: Fourier Boundary Corp. Call.

Data• Recall Standard PC1, PC2, PC3:

– Gave useful insights

• Correlation PC1, PC2, PC3– Not useful directions– No insights about population– Driven by high frequency noise artifacts– Reason: whitening has damped the

important structure– By magnifying high frequency noise

Page 38: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCAParallel coordinates show what

happened:

Most Variation inlow frequencies

Whitening givesmajor distortion

Skews PCA towardsnoise directions

Page 39: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Correlation PCASummary on correlation PCA:

• Can be very useful

(especially with noncommensurate units)

• Not always, can hide important structure

• To make choice:

Decide whether whitening is useful

• My personal use of correlat’n PCA is rare

• Other people use it most of the time

Page 40: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

Clusters in dataCommon Statistical Task:

Find Clusters in Data

• Interesting sub-populations?

• Important structure in data?

• How to do this?

PCA provides very simple approach

There is a large literature of other methods

(will study more later)

Page 41: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersRevisit Toy Example (2 clusters)

Page 42: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersToy Example (2 clusters)

• Dominant direction finds very distinct clusters

• Skewer through meatballs (in point cloud space)

• Shows up clearly in scores plot

• An important use of scores plot is finding such structure

Page 43: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersRecall Toy Example with more clusters:

Page 44: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersBest revealed by 2d scatterplots (4 clusters):

Page 45: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersCaution: there are limitations…

• Revealed by NCI60 data– Recall Microarray data

– With 8 known different cancer types

• Could separate out clusters– Using specialized DWD directions

• But mostly not found using PCA– Recall only finds dirn’s of max variation

– Does not use class label information

Page 46: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersSpecific DWD directions, for NCI60

Data:

Good

Separation

Of Cancer

Types

Page 47: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersPCA directions, for NCI60 Data:

PC2:

Melanoma

PC1-3

Leukemia

Page 48: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersOff Diagonal PC1-4 & 5-8, NCI60 Data:

PC1 & 5

Renal

Leukemia

(separated)

PC2:

Melanoma

Page 49: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersMain Lesson:

PCA is limited a finding clusters

• Revealed by NCI60 data– Recall Microarray data

– With 8 known different cancer types

• PCA does not find all 8 clusters– Recall only finds dirn’s of max variation

– Does not use class label information

Page 50: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersA deeper example:

Mass Flux Data

• Data from Enrica Bellone,– National Center for Atmospheric

Research

• Mass Flux for quantifying cloud types

• How does mass change when moving into a cloud

Page 51: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersPCA of Mass Flux Data:

Page 52: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersSummary of PCA of Mass Flux Data:• Mean:

Captures General mountain shape• PC1:

Generally overall height of peak– shows up nicely in mean +- plot (2nd

col)– 3 apparent clusters in scores plot– Are those “really there”?– If so, could lead to interesting discovery– If not, could waste effort in investigation

Page 53: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersSummary of PCA of Mass Flux Data:

PC2:

Location of peak

• again mean +- plot very useful here

PC3:

Width adjustment

• again see most clearly in mean +- plot

Maybe non-linear modes of variation???

Page 54: Object Orie’d Data Analysis, Last Time Cornea Data –Images (on the disk) as data objects –Zernike basis representations Outliers in PCA (have major influence)

PCA to find clustersReturn to Investigation of PC1 Clusters:• Can see 3 bumps in smooth

histogramMain Question:

Important structureor

sampling variability?

Approach: SiZer(SIgnificance of ZERo crossings of deriv.)