object orie’d data analysis, last time classical discrimination (aka classification) –fld &...

Post on 13-Dec-2015

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Object Orie’d Data Analysis, Last Time

• Classical Discrimination (aka Classification)– FLD & GLR very attractive

– MD never better, sometimes worse

• HDLSS Discrimination– FLD & GLR fall apart

– MD much better

• Maximal Data Piling– HDLSS space is a strange place

Kernel EmbeddingAizerman, Braverman and Rozoner

(1964) • Motivating idea:

Extend scope of linear discrimination,By adding nonlinear components to data

(embedding in a higher dim’al space)

• Better use of name:nonlinear discrimination?

Kernel EmbeddingStronger effects for higher order polynomial embedding:

E.g. for cubic,

linear separation can give 4 parts (or fewer)

332 :,, xxxx

Kernel EmbeddingGeneral View: for original data matrix:

add rows:

i.e. embed in ThenHigher sliceDimensional with aSpace hyperplane

dnd

n

xx

xx

1

111

nn

dnd

n

dnd

n

xxxx

xx

xx

xx

xx

212111

221

21

211

1

111

Kernel EmbeddingEmbeddedFisher Linear Discrimination:

Choose Class 1, for any when:

in embedded space.• image of class boundaries in original

space is nonlinear• allows more complicated class regions• Can also do Gaussian Lik. Rat. (or

others) • Compute image by classifying points

from original space

dx 0

)2()1(1)2()1()2()1(10 ˆ2

1ˆ XXXXXXx wwt

Kernel EmbeddingVisualization for Toy Examples:• Have Linear Disc. In Embedded Space• Study Effect in Original Data Space• Via Implied Nonlinear RegionsApproach:• Use Test Set in Original Space

(dense equally spaced grid)• Apply embedded discrimination Rule• Color Using the Result

Kernel EmbeddingPolynomial Embedding, Toy Example 1:Parallel Clouds

Kernel EmbeddingPolynomial Embedding, Toy Example 1:

Parallel Clouds• PC 1:

– always bad– finds “embedded greatest var.” only)

• FLD: – stays good

• GLR: – OK discrimination at data– but overfitting problems

Kernel EmbeddingPolynomial Embedding, Toy Example 2:Split X

Kernel EmbeddingPolynomial Embedding, Toy Example 2:

Split X

• FLD:

– Rapidly improves with higher degree

• GLR:

– Always good

– but never ellipse around blues…

Kernel EmbeddingPolynomial Embedding, Toy Example 3:Donut

Kernel EmbeddingPolynomial Embedding, Toy Example 3:

Donut

• FLD: – Poor fit for low degree

– then good

– no overfit

• GLR: – Best with No Embed,

– Square shape for overfitting?

Kernel Embedding

Drawbacks to polynomial embedding:

• too many extra terms create

spurious structure

• i.e. have “overfitting”

• HDLSS problems typically get worse

Kernel EmbeddingHot Topic Variation: “Kernel Machines”

Idea: replace polynomials by

other nonlinear functions

e.g. 1: sigmoid functions from neural nets

e.g. 2: radial basis functions

Gaussian kernels

Related to “kernel density estimation”

(recall: smoothed histogram)

Kernel EmbeddingRadial Basis Functions:

Note: there are several ways to embed:

• Naïve Embedding (equally spaced grid)

• Explicit Embedding (evaluate at data)

• Implicit Emdedding (inner prod. based)

(everybody currently does the latter)

Kernel EmbeddingNaïve Embedding, Radial basis

functions:

At some “grid points” ,

For a “bandwidth” (i.e. standard dev’n) ,

Consider ( dim’al) functions:

Replace data matrix with:

kgg ,...,

1

d

kgxgx ,...,

1

knk

n

gXgX

gXgX

1

111

Kernel EmbeddingNaïve Embedding, Radial basis

functions:

For discrimination:

Work in radial basis space,

With new data vector ,

represented by:

10

10

gX

gX

0X

Kernel EmbeddingNaïve Embedd’g, Toy E.g. 1: Parallel

Clouds

• Good

at data

• Poor

outside

Kernel EmbeddingNaïve Embedd’g, Toy E.g. 2: Split X

• OK at

data

• Strange

outside

Kernel EmbeddingNaïve Embedd’g, Toy E.g. 3: Donut

• Mostly

good

• Slight

mistake

for one

kernel

Kernel Embedding

Naïve Embedding, Radial basis

functions:

Toy Example, Main lessons:

• Generally good in regions with data,

• Unpredictable where data are sparse

Kernel EmbeddingToy Example 4: Checkerboard

VeryChallenging!

LinearMethod?

PolynomialEmbedding?

Kernel EmbeddingToy Example 4: Checkerboard

Polynomial Embedding:

• Very poor for linear

• Slightly better for higher degrees

• Overall very poor

• Polynomials don’t have needed

flexibility

Kernel EmbeddingToy Example 4: CheckerboardRadialBasisEmbedding+ FLDIsExcellent!

Kernel EmbeddingDrawbacks to naïve embedding:

• Equally spaced grid too big in high d

• Not computationally tractable (gd)

Approach:

• Evaluate only at data points

• Not on full grid

• But where data live

Kernel EmbeddingOther types of embedding:

• Explicit

• Implicit

Will be studied soon, after

introduction to Support Vector Machines…

Kernel Embedding generalizations of this idea to other

types of analysis

& some clever computational ideas.

E.g. “Kernel based, nonlinear Principal

Components Analysis”

Ref: Schölkopf, Smola and Müller

(1998)

Support Vector MachinesMotivation:

• Find a linear method that “works well”for embedded data

• Note: Embedded data are very non-Gaussian

• Suggests value ofreally new approach

Support Vector MachinesClassical References:

• Vapnik (1982)

• Boser, Guyon & Vapnik (1992)

• Vapnik (1995)

Excellent Web Resource:

• http://www.kernel-machines.org/

Support Vector MachinesRecommended tutorial:

• Burges (1998)

Recommended Monographs:

• Cristianini & Shawe-Taylor (2000)

• Schölkopf & Alex Smola (2002)

Support Vector MachinesGraphical View, using Toy Example:

• Find separating plane

• To maximize distances from data to plane

• In particular smallest distance

• Data points closest are called

support vectors

• Gap between is called margin

SVMs, Optimization Viewpoint

Formulate Optimization problem, based on:

• Data (feature) vectors • Class Labels • Normal Vector • Location (determines intercept) • Residuals (right side) • Residuals (wrong side) • Solve (convex problem) by quadratic

programming

nxx ,...,1

1iyw

b bwxyr tiii

ii r

SVMs, Optimization Viewpoint

Lagrange Multipliers primal formulation (separable case):

• Minimize: Where are Lagrange

multipliers

Dual Lagrangian version:• Maximize:

Get classification function:

n

iiiiP bwxywbwL

1

2

21 1,,

0,...,1 n

i ji

jijijiiD xxyyL,

21

n

iiii bxxyxf

1

SVMs, ComputationMajor Computational Point:• Classifier only depends on data

through inner products!• Thus enough to only store inner

products• Creates big savings in optimization• Especially for HDLSS data• But also creates variations in kernel

embedding (interpretation?!?)• This is almost always done in practice

SVMs, Comput’n & Embedding

For an “Embedding Map”,

e.g.

Explicit Embedding:

Maximize:

Get classification function:

• Straightforward application of embedding

• But loses inner product advantage

x

2x

xx

i ji

jijijiiD xxyyL,

21

n

iiii bxxyxf

1

SVMs, Comput’n & EmbeddingImplicit Embedding:

Maximize:

Get classification function:

• Still defined only via inner products• Retains optimization advantage• Thus used very commonly• Comparison to explicit embedding?• Which is “better”???

i ji

jijijiiD xxyyL,

21

n

iiii bxxyxf

1

SVMs & RobustnessUsually not severely affected by outliers,But a possible weakness:

Can have very influential pointsToy E.g., only 2 points drive SVMNotes:• Huge range of chosen hyperplanes• But all are “pretty good discriminators”• Only happens when whole range is

OK???• Good or bad?

SVMs & RobustnessEffect of violators (toy example):

• Depends on distance to plane

• Weak for violators nearby

• Strong as they move away

• Can have major impact on plane

• Also depends on tuning parameter C

SVMs, Computation Caution: available algorithms are not

created equal

Toy Example:

• Gunn’s Matlab code

• Todd’s Matlab code

Serious errors in Gunn’s version, does not find real optimum…

SVMs, Tuning Parameter Recall Regularization Parameter C:

• Controls penalty for violation

• I.e. lying on wrong side of plane

• Appears in slack variables

• Affects performance of SVM

Toy Example:

d = 50, Spherical Gaussian data

SVMs, Tuning Parameter Toy Example:

d = 50, Spherical Gaussian dataX=Axis: Opt. Dir’n Other: SVM Dir’n• Small C:

– Where is the margin?– Small angle to optimal (generalizable)

• Large C:– More data piling– Larger angle (less generalizable)– Bigger gap (but maybe not better???)

• Between: Very small range

SVMs, Tuning Parameter Toy Example:

d = 50, Spherical Gaussian data

Careful look at small C:

Put MD on horizontal axis E.g.

• Shows SVM and MD same for C small– Mathematics behind this?

• Separates for large C– No data piling for MD

Distance Weighted Discrim’n

Improvement of SVM for HDLSS DataToy e.g.

(similar toearlier movie)

50d)1,0(N

2.21 20 nn

Distance Weighted Discrim’n

Toy e.g.: Maximal Data Piling Direction- Perfect

Separation- Gross

Overfitting- Large Angle- Poor

Gen’ability

Distance Weighted Discrim’n Toy e.g.: Support Vector Machine

Direction- Bigger Gap- Smaller Angle- Better

Gen’ability- Feels support

vectors toostrongly???

- Ugly subpops?- Improvement?

Distance Weighted Discrim’n Toy e.g.: Distance Weighted

Discrimination- Addresses

these issues- Smaller Angle- Better

Gen’ability- Nice subpops- Replaces

min dist. by avg. dist.

Distance Weighted Discrim’n Based on Optimization Problem:

More precisely: Work in appropriate penalty for violations

Optimization Method:Second Order Cone Programming

• “Still convex” gen’n of quad’c program’g

• Allows fast greedy solution• Can use available fast software

(SDP3, Michael Todd, et al)

n

i iw r1,

1min

Distance Weighted Discrim’n 2=d Visualization:

Pushes PlaneAway FromData

All PointsHave SomeInfluence

n

i iw r1,

1min

4949

UNC, Stat & OR

DWD Batch and Source AdjustmentDWD Batch and Source Adjustment

Recall from Class Meeting, 9/6/05: For Perou’s Stanford Breast Cancer Data Analysis in Benito, et al (2004)

Bioinformaticshttps://genome.unc.edu/pubsup/dwd/

Use DWD as useful direction vector to: Adjust for Source Effects

Different sources of mRNA Adjust for Batch Effects

Arrays fabricated at different times

5050

UNC, Stat & OR

DWD Adj: Biological Class Colors & DWD Adj: Biological Class Colors & SymbolsSymbols

5151

UNC, Stat & OR

DWD Adj: Source ColorsDWD Adj: Source Colors

5252

UNC, Stat & OR

DWD Adj: Source Adj’d, PCA viewDWD Adj: Source Adj’d, PCA view

5353

UNC, Stat & OR

DWD Adj: Source Adj’d, Class ColoredDWD Adj: Source Adj’d, Class Colored

5454

UNC, Stat & OR

DWD Adj: S. & B Adj’d, Adj’d PCADWD Adj: S. & B Adj’d, Adj’d PCA

5555

UNC, Stat & OR

Why not adjust using SVM?

Major Problem: Proj’d Distrib’al

Shape

Triangular Dist’ns (opposite skewed)

Does not allow sensible rigid shift

5656

UNC, Stat & OR

Why not adjust using SVM?

Nicely Fixed by DWD

Projected Dist’ns near Gaussian

Sensible to shift

5757

UNC, Stat & OR

Why not adjust by means?

DWD is complicated: value added?

Xuxin Liu example…

Key is sizes of biological subtypes

Differing ratio trips up mean

But DWD more robust

(although still not perfect)

5858

UNC, Stat & OR

Twiddle ratios of subtypes

Link toMovie

5959

UNC, Stat & OR

DWD in Face Recognition, I

Face Images as Data

(with M. Benito & D. Peña)

Registered using

landmarks

Male – Female Difference?

Discrimination Rule?

6060

UNC, Stat & OR

DWD in Face Recognition, II

DWD Direction

Good separation

Images “make

sense”

Garbage at ends?

(extrapolation

effects?)

6161

UNC, Stat & OR

DWD in Face Recognition, III

Interesting summary:

Jump between

means

(in DWD direction)

Clear separation of

Maleness vs.

Femaleness

6262

UNC, Stat & OR

DWD in Face Recognition, IV

Fun Comparison:

Jump between means

(in SVM direction)

Also distinguishes

Maleness vs.

Femaleness

But not as well as

DWD

6363

UNC, Stat & OR

DWD in Face Recognition, V

Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

6464

UNC, Stat & OR

DWD in Face Recognition, VI

Current Work:

Focus on “drivers”:

(regions of interest)

Relation to Discr’n?

Which is “best”?

Lessons for human

perception?

• Fix links on face movies

• Next Topics:

• DWD outcomes, from SAMSI below

• DWD simulations, from SAMSI below

• Windup from FDA04-22-02.doc– General Conclusion– Validation

• Also SVMoverviewSAMSI09-06-03.doc

• Multi-Class SVMs• Lee, Y., Lin, Y. and Wahba, G. (2002)

"Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data", U. Wisc. TR 1064.

• So far only have “implicit” version• “Direction based” variation is unknown

top related