hebrew university1 a new paradigm for feature selection with some surprising results amnon shashua...

35
Hebrew University 1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University Joint work with Lior Wolf Wolf & Shashua, ICCV’03

Upload: rosalyn-walters

Post on 03-Jan-2016

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 1

A New Paradigm for Feature SelectionWith some surprising results

Amnon Shashua

School of Computer Science & Eng.The Hebrew University

Joint work with Lior Wolf

Wolf & Shashua, ICCV’03

Page 2: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 2

....

1M iM qMTm1Tm2

Tnm

Given a sample of feature measurementsqn

q RMMM ×∈= ],...,[ 1

feature vector

data point

12=im

Find a subset of features

{ }Ti

Ti s

mm ,...,1

which are most “relevant” with respect to aninference (learning) task.

Problem Definition

Comments:

• Data points can represent images (pixels), feature attributes, wavelet coefs,..• The task is to select a subset of coordinates of the data points such that the accuracy, confidence, and training sample size of a learning algorithm would be optimal - ideally. • Need to define a “relevance” score.• Need to overcome the exponential nature of subset selection.• If a “soft” selection approach is used, need to make sure the solution is “sparse”.

Page 3: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 3

Examples:

• Text Classification: typically 74 1010 − features representing word frequency

counts – only a small fraction is expected to be relevant. Typical examplesinclude automatic sorting of URLs into web directory and detection of SpamEmail.

• Visual Recognition:

-| - |1 similarity( , ) = e

• Genomics:

tissue samples

Ge

ne

exp

ress

ion

s

Goal: recognizing the relevant genes which separate between normal and tumor cells, between different sub classes of cancer, and so on.

Page 4: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 4

Why Select Features?

• Most learning algorithms do not scale well with the growth of irrelevant features. ex1: number of training examples for some supervised learning methods grow exponentially. ex2: for classifiers which can optimize their “capacity” (e.g., large margin hyper-planes)

• Computational efficiency considerations when number of coordinates is very high.

• Run-time of the (already trained) inference engine on new test examples.

m ≥1

εlog

1

δ+

d

εlog

1

ε

the effective VC dimension d grows fast with irrelevant coordinates - faster than the capacity increase.

• Structure of data gets obscured with large amounts of irrelevant coordinates.

Page 5: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 5

Existing Approaches

• Filter methods: pre-process of the data independent of the inference engine. examples: use of mutual information measure, correlation coefficients, cluster..

• Embedded, Wrapper: select features useful to build a good predictor/inference. example: run SVM training on every candidate subset of features. Computationally expensive approach in general.

Page 6: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 6

Feature Subset Relevance - Key Idea

1M iM qMTm1Tm2

Tnm

ˆ M 1

ˆ M i

ˆ M q€

M i ∈ Rn

ˆ M i ∈ R l

l ≤ n

R l

Working Assumption: the relevant subset of rows induce columns that are coherently clustered.

Note: we are assuming an unsupervised setting (data points are not labeled). The framework can easilyapply to supervised settings as well.

Page 7: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 7

• How to measure cluster coherency? We wish to avoid explicitly clustering for each subset of rows. We wish a measure which is amenable to continuous functional analysis.

key idea: use spectral information from the affinity matrix

ˆ M 1

ˆ M i

ˆ M q

ˆ M =

ˆ M T ˆ M

• How to represent ?

ˆ M T ˆ M

{ }liis ,...,1=

Aα = ˆ M T ˆ M = α imimiT

i=1

n

subset of features⎩⎨⎧ ∈

=otherwise

sii 0

Page 8: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 8

Definition of RelevancyThe Standard Spectrum

General Idea:

Select a subset of rows from the sample matrix M such that the resultingaffinity matrix will have high values associated with the first k eigenvalues.

....

Tm1Tim1

Tnm

{ }liis ,...,1=

),...,(1 lii xxrel )( αααα QAAQtrace TT=

∑ ==

k

j j1

Tim2

Til

m

∑==

n

i

Tiii mmA

1αα

subset of features⎩⎨⎧ ∈

=otherwise

sii 0

resulting affinity matrix

αQ consists of the first k eigenvectors of αA

Page 9: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 9

(unsupervised) Optimization Problem

( )QAAQtrace TT

Q nαα

αα ,...,, 1

max

∑==

n

i

Tiii mmA

1αα

IQQT =

Let

subject to

α ∈ 0,1{ }n

Optimization is too difficult to be considered in practice (integer and continuous variables programming).

Page 10: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 10

(unsupervised) Optimization ProblemSoft Version I

maxQ,α 1 ,...,α n

trace QT AαT Aα Q( ) + h(α )

∑==

n

i

Tiii mmA

1αα

IQQT =

Let

subject to

α i ≥ 0

The non-linear function penalizes for uniform

The result is a non-linear programming problem - could be quite difficult to solve.

α i

i

∑ =1

h(α )

α

Page 11: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 11

(unsupervised) Optimization Problem

( )QAAQtrace TT

Q nαα

αα ,...,, 1

max

∑==

n

i

Tiii mmA

1αα

IQQT =

Let for some unknown real scalars ( )Tnααα ,...,1=

subject to

1=αα T

Note: the optimization definition ignores the requirements:

1. 2. The weight vector should be sparse.

0≥iαα

Motivation: from spectral clustering it is knownthat the eigenvectors tend to be discontinuous and that may lead to an effortless sparsityproperty.

Page 12: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 12

The Algorithmα−Q

( )QAAQtrace TT

Q nαα

αα ,...,, 1

max IQQT = 1=αα T

If were known, then is known and Q is simply the first k eigenvectors ofα αA αA

If Q were known, then the problem becomes:

αααα

GT

n,...,1

max subject to 1=αα T

jTT

ijTiij QmQmmmG )(=where

α is the largest eigenvector of G

Page 13: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 13

The AlgorithmPower-embedded

α−Q

jrrT

ijTi

rij mQQmmmG

T )1()1()( )( −−=1. Let be defined)(rG

2. Let be the largest eigenvector of )(rG)(rα

3. Let ∑==

n

i

Tii

ri

r mmA1

)()( α

4. Let )1()()( −= rrr QAZ

5. )()()( rrQRr RQZ ⏐ →⏐ “QR” factorization step

6. Increment r

Convergence proof: take k=1 for example. Steps 4,5 become:qq

Aqq

Tr =)(

Need to show: qAqqAq Trr T 2)(2)( ≥ qAqqAq

qAq TT

T2

2

4

For all symmetric matrices A and unit vectors q

follows from convexity

orthogonal iteration

Page 14: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 14

Positivity and Sparsity ofHand-Waving Argument

α

( )QAAQtrace TT

Q nαα

αα ,...,, 1

maxarg { }22

,...,, 1

minargFF

T

QAAQQA

nααα

αα−−=

minimized if rank(A)=k

add redundant terms

∑==

n

i

Tiii mmA

1αα = sum of rank-1 matrices

If we would like rank(A) to be small, we shouldn’tadd too many rank-1 terms, Therefore should be sparse.α

Note: this argument does not say anything with regard to why should be positive.α

Page 15: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 15

Positivity and Sparsity of αThe key for the emergence of a sparse and positive has to do with the wayThe entries of are defined:

αG

))()(()(1 j

Til

Tj

k

l lTij

TTij

Tiij mmqmqmQmQmmmG ∑=

==

Consider k=1 for example, then each entry is of the form:

))()(( cbcabaf TTT= 1=== cba

Clearly, 11 ≤<− f 1−=f1)(,1)(,1)( −=−=−= cbcaba TTT

only if

which cannot happen

a

b

Expected values of the entries of G are biased towards a positive number

1)(,1)(,1)( ==−= cbcaba TTTor

Page 16: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 16

Positivity and Sparsity of α

1. What is the minimal value of ))()(( cbcabaf TTT= when

vary over the n-dimensional unit hyper sphere?

18

1≤≤− f

2. Given a uniform sampling of over the n-dim unit hyper sphere, what is the mean and variance of

cba ,,

cba ,,μ 2σ f

18

1,

6

1 2 == σμ

3. Given that what is the probability that the first eigenvector),(~ 2σμNGij

of is strictly non-negative (same sign)?G

Page 17: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 17

Proposition 3:

⎪⎩

⎪⎨

=≠>

ji

ji

N

NGij

),2

1(

),0(~

2

2

σ

σμwith an infinitesimal

Let be the largest eigenvector. Then,xGx λ=

11

)0(],0[

⏐⏐ →⏐⎟⎠

⎞⎜⎝

⎛Φ=> ∞→n

n

n nxp

μσ

where ( )x],[ 2σμ

Φ Is the cumulative distribution function of ),( 2σμN

n

n n⎟⎠

⎞⎜⎝

⎛Φ1

],0[μσ

empirical

Page 18: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 18

Proposition 4: (with O. Zeitouni and M. Ben-Or)

1)0( ⏐⏐ →⏐=> ∞→nxpσwhen for any value of 0>μ

Page 19: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 19

Sparsity Gapdefinition

Let be the fraction of relevant features and 10 << p pq −=1

Let ⎥⎦

⎤⎢⎣

⎡=

CB

BAG T where

),(~ 2σμanpnp NA ×

),(~, 2σμbnqnq NCB × 6

1=bμ

Let Txxx ),( 21= be the largest eigenvector of G, where holds the first npentries and holds the remaining nq entries.

1x2x

The sparsity gap corresponding to G is the ratio2

1

x

x=ρ

where 1x is the mean of and 1x2x is the mean of 2x

Page 20: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 20

Sparsity GapProposition 4:

Let

22×⎥⎦

⎤⎢⎣

⎡=

bb

ba

nqnpnqnp

Gμμμμ

Let Txxx ),( 21= be the largest eigenvector of G

The sparsity gap corresponding to G is:

),0(

),0(

2

2

2

1

nqNx

npNx

σ

σ

ρ+

+=

Example: 100,6

1,85.0 === nba μμ

p

ppp

20

10082033211061 2 +−+−=ρ

Page 21: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 21

The feature selection for SVM benchmark

• Two synthetic data sets were proposed in a paper by Weston, Mukherjee, Chapelle, Pontil, Poggio and Vapnik from NIPS 2001.

• The data sets were designed to have few features which are separable by SVM, combined with many non relevant features.

• The data sets were designed for the labeled case.

The linear dataset• The linear data set is almost separable linearly once the correct features are recovered.

• There were 6 relevant features and 196 irrelevant ones.

• At probability 0.7 the data is almost separable by the first 3 relevant features and un-separable by the rest 3 relevant features. At probability 0.3 the second group of relevant features is the separable one. Remaining 196 features were drawn from N(0,20).

Page 22: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Results – linear data set

Page 23: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

The unsupervised algorithm started to be effective only from 80 data points and up and is not shown here

Results – non-linear data set

Page 24: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 24

There are two species of frogs in this figure:

Page 25: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 25American toadGreen frog (Rana clamitans)

Page 26: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 26

Automatic separation

• We use small patches as basic features:

-| - |1

similarity( , ) = e

• In order to compare patches we use the L1 norm on the color histograms:

Page 27: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 27

The matrix A: many features over ~40 images

The similarity between an image A and a patch Bis the maximum over all similarities between the patches p in the image A and of the patch B

similarity( , ) = max similarity( , )

Page 28: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 28

American toadGreen frog (Rana clamitans)

Selected features

Using these features the clustering was correct on 80% of the samples – compared to 55% correct clustering using conventional spectral clustering

Page 29: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 29

sea-elephantelephant

Another example

Using these features the clustering was correct on 90% of the samples – compared to 65% correct clustering using conventional spectral clustering

Page 30: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 30

Genomics

tissue samples

Gen

e ex

pres

sion

s

The microarray technology provides many measurements of gene expressions for different sample tissues.

Goal: recognizing the relevantgenes that separate betweencells with different biological characteristics (normal vs. tumor,different subclasses of tumor cells)

• Classification of Tissue Samples (type of Cancer, Normal vs. Tumor)• Find Novel Subclasses (unsupervised)• Find Genes responsible for classification (new insights for drug design).

Few samples (~50) and large dimension (~10,000)

Page 31: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 31

The synthetic dataset of Ben-Dor, Friedman, and Yakhini

• The model consists of 6 parameters:

),( AA sN μμ

Parameter description Leukemia

a # class A Samples 25

b # class b Samples 47

m # features 600

e,(1-e) % irrelevant/relevant 72%,28%

(3d) Size of interval of means d=555

s Std coefficient .75

• A relevant feature is sampled or where the means of the classes μ μ are sampled uniformly from [-1.5d,1.5d]

• An irrelevant feature is sampled N(0,s)

),( AA sN μμ ),( BB sN μμ

Page 32: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 32

Param. description Leuk. MSA Q-α remarksa # class A Samples 25

b # class b Samples 47

m # features 432 <250 <5 MSA uses redundancy

e % irrelevant features 168 >95% >99.5% Easy data set

d Size of interval 555 [1,1000] At least[1,1000]

data is normalized

s Spread .75 <2 <1000 MSA needs good separation

The synthetic dataset of Ben-Dor, Friedman, and Yakhini

• MSA – max surprise algorithm of Ben-Dor, Friedman, and Yakhini.

• Results of simulations done by varying one parameter out of sedm ,,,

Page 33: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 33

Follow Up Work

Feature selection with “side” information:

∑=

n

i

Tiii mm

( )Tnααα ,...,1=

∑=

n

i

Tiii ww

WM ,Given the “main” and “side” data. Find weights

such that has coherent k clusters

has low cluster coherence (single cluster)and

Shashua & Wolf, ECCV’04

Page 34: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 34

“Kernalizing” :α−Q

)( ii mm φ→ high dimensional mapping

),()()( jijT

i mmkmm =φφ

Follow Up Work

∑==

n

i

Tiii mmA

1)()( φφαα

Rather than having inner-products we have outer-products.

Shashua & Wolf, ECCV’04

Page 35: Hebrew University1 A New Paradigm for Feature Selection With some surprising results Amnon Shashua School of Computer Science & Eng. The Hebrew University

Hebrew University 35

END