similarity-based classifiers: problems and solutions

74
Similarity-based Classifiers: Problems and Solutions

Upload: nitara

Post on 22-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Similarity-based Classifiers: Problems and Solutions. Classifying based on similarities :. Van Gogh. Monet. Van Gogh Or Monet ?. the Similarity-based Classification Problem. (paintings). (painter). the Similarity-based Classification Problem. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Similarity-based Classifiers: Problems and Solutions

Similarity-based Classifiers:

Problems and Solutions

Page 2: Similarity-based Classifiers: Problems and Solutions

2

Classifying based on similarities:

Van GoghOr

Monet ?

Van Gogh

Monet

Page 3: Similarity-based Classifiers: Problems and Solutions

3

the Similarity-based Classification Problem

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

(painter)(paintings)

Page 4: Similarity-based Classifiers: Problems and Solutions

4

the Similarity-based Classification Problem

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;nUnderlyingSimilarity Function: à : У Ð ! R

Training Similarities: S = £Ã(xi;xj )¤n£n ; y =

£y1 : : : yn¤T

Page 5: Similarity-based Classifiers: Problems and Solutions

5

the Similarity-based Classification Problem

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

Training Similarities: S = £Ã(xi;xj )¤n£n ; y =

£y1 : : : yn¤T

Test Similarities: s = £Ã(x;x1) : :: Ã(x;xn)¤T ; Ã(x;x)

Problem: Estimate theclass label yfor test samplex given S, y, s, and Ã(x;x).

UnderlyingSimilarity Function: à : У Р! R

?

Page 6: Similarity-based Classifiers: Problems and Solutions

6

Examples of Similarity FunctionsComputational Biology

– Smith-Waterman algorithm (Smith & Waterman, 1981)

– FASTA algorithm (Lipman & Pearson, 1985)– BLAST algorithm (Altschul et al., 1990)

Computer Vision– Tangent distance (Duda et al., 2001)– Earth mover’s distance (Rubner et al., 2000)– Shape matching distance (Belongie et al., 2002)– Pyramid match kernel (Grauman & Darrell, 2007)

Information Retrieval– Levenshtein distance (Levenshtein, 1966)– Cosine similarity between tf-idf vectors (Manning

& Schütze, 1999)

Page 7: Similarity-based Classifiers: Problems and Solutions

7

Approaches to Similarity-based Classification

MDSSimilariti

es as kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

Page 8: Similarity-based Classifiers: Problems and Solutions

8

Approaches to Similarity-based Classification

MDSSimilariti

es as kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

Page 9: Similarity-based Classifiers: Problems and Solutions

9

Can we treat similarities as kernels?

Kernels are inner products in someHilbert space.

Page 10: Similarity-based Classifiers: Problems and Solutions

10

Can we treat similarities as kernels?

conjugatesymmetric, reallinear: hax;zi = a< x;z >positivede nite: hx;xi > 0unless x = 0

Example Inner Product hx;zi = xTz.

Properties of an Inner Product hx;zi :

Kernels are inner products in someHilbert space.x

zhx;zi

An inner product implies a norm: kxk=phx;xi

Page 11: Similarity-based Classifiers: Problems and Solutions

11

Can we treat similarities as kernels?

Kernels are inner products in someHilbert space.

Inner products aresimilarities.

Areour notions of similarities always inner products?No!

Page 12: Similarity-based Classifiers: Problems and Solutions

12

Example: Amazon similarity

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon

96 books

96 b

ooks S

Inner product-like?

Page 13: Similarity-based Classifiers: Problems and Solutions

13

Example: Amazon similarity

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon

96 books

96 b

ooks S

Á(HTF, Bishop) = 3Á(Bishop, HTF) = 8

assymmetric!

Page 14: Similarity-based Classifiers: Problems and Solutions

0 10 20 30 40 50 60 70 80 90-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalue Rank

Eig

enva

lue

Example: Amazon similarity

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon

96 books

96 b

ooks S

negative

Rank

Eige

nval

ues

Not PSD!

Page 15: Similarity-based Classifiers: Problems and Solutions

15

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Clip:Sclip =U diag(max(¸1;0); : : : ;max(¸n;0))UT

0 0

S

Sclip

PSD Cone

Sclip is thePSD matrix closest to Sin terms of theFrobenius norm.

Page 16: Similarity-based Classifiers: Problems and Solutions

16

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Flip:S°ip =U diag(j¸1j; : : : ; j¸nj) UT

0 0

(similar e®ect: Snew =STS)

Page 17: Similarity-based Classifiers: Problems and Solutions

17

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

0 0

Shift:Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

Page 18: Similarity-based Classifiers: Problems and Solutions

18

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

0 0

Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

Flip, Clip or Shift?Best bet is Clip.

Page 19: Similarity-based Classifiers: Problems and Solutions

19

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Learn the best kernel matrix for the SVM:(Luss NIPS 2007, Chen et al. ICML 2009)

minK º 0

minf 2H K

1n

nX

i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF

Page 20: Similarity-based Classifiers: Problems and Solutions

20

Approaches to Similarity-based Classification.

MDSSimilariti

es as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

Page 21: Similarity-based Classifiers: Problems and Solutions

21

Let the similarities to the training samples be features

– SVM (Graepel et al., 1998; Liao & Noble, 2003)– Linear programming (LP) machine (Graepel et al.,

1999)– Linear discriminant analysis (LDA) (Pekalska et al.,

2001)– Quadratic discriminant analysis (QDA) (Pekalska &

Duin, 2002)– Potential support vector machine (P-SVM) (Hochreiter

& Obermayer, 2006; Knebel et al., 2008)

Let £Ã(x;x1) :: : Ã(x;xn)¤T 2 Rn be the featurevector for x.

minimize®12ky ¡ S®k

22+²k®k1+°k®k1

Asymptotically does thiswork?Our results suggest you need to choosea slow-growing subset of n.

Page 22: Similarity-based Classifiers: Problems and Solutions

22

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Page 23: Similarity-based Classifiers: Problems and Solutions

23

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM-kNN(clip)(Zhang et al. 2006)

17.56 13.75 36.82 4.23 61.25 5.23

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Page 24: Similarity-based Classifiers: Problems and Solutions

24

Approaches to Similarity-based Classification

MDSSimilariti

es as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

Page 25: Similarity-based Classifiers: Problems and Solutions

25

Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

kX

i=1wi I f yi=gg

?

Page 26: Similarity-based Classifiers: Problems and Solutions

26

Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

kX

i=1wi I f yi=gg

P (Y = gjX = x) =kX

i=1wi I f yi=gg

For wi ¸ 0 and P i wi =1, get class posterior estimate:

Good for asymmetric costsGood for interpretationGood for system integration.

Page 27: Similarity-based Classifiers: Problems and Solutions

27

Design Goals for the Weights

?

Page 28: Similarity-based Classifiers: Problems and Solutions

28

Design Goals for the Weights

Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).

?

Page 29: Similarity-based Classifiers: Problems and Solutions

29

Design Goals for the Weights

?

Page 30: Similarity-based Classifiers: Problems and Solutions

30

Design Goals for the Weights (Chen et al. JMLR 2009)

Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).

?

Page 31: Similarity-based Classifiers: Problems and Solutions

31

Linear Interpolation WeightsLinear interpolation weights will meet these goals:

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

x1x2 x3

x4x

non-uniquesolution

Page 32: Similarity-based Classifiers: Problems and Solutions

32

Linear Interpolation WeightsLinear interpolation weights will meet these goals:

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

x1x2 x3

x4x

non-uniquesolution

x1x2 x3

x4 x

no solution

Page 33: Similarity-based Classifiers: Problems and Solutions

33

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

LIME weightsLinear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

Page 34: Similarity-based Classifiers: Problems and Solutions

34

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

LIME weightsLinear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

Page 35: Similarity-based Classifiers: Problems and Solutions

35

LIME weightsLinear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

maximumentropy ! push weights to beequal

Page 36: Similarity-based Classifiers: Problems and Solutions

36

LIME weightsLinear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

maximumentropy = exponential solutionconsistent (Friedlander Gupta IEEE IT 2005)noiseaveraging

Page 37: Similarity-based Classifiers: Problems and Solutions

37

Kernelize Linear Interpolation (Chen et al. JMLR 2009)

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

Twsubject to w ¸ 0; 1Tw= 1;

LIME weights:

Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:

Page 38: Similarity-based Classifiers: Problems and Solutions

38

Kernelize Linear Interpolation

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

Twsubject to w ¸ 0; 1Tw= 1;

LIME weights:

Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:

regularizes the variance of the weights

Page 39: Similarity-based Classifiers: Problems and Solutions

39

Kernelize Linear Interpolation

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

Twsubject to w ¸ 0; 1Tw= 1;

LIME weights:

Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:

only need inner products – can replace with kernel or similarities!

Page 40: Similarity-based Classifiers: Problems and Solutions

40

minimizew12w

TSw¡ sTw+ ¸2w

Twsubject to w ¸ 0; 1Tw=1:

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

Page 41: Similarity-based Classifiers: Problems and Solutions

41

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

minimizew12w

TSw¡ sTw+ ¸2w

Twsubject to w ¸ 0; 1Tw=1:

affinity: s = £Ã(x;x1) ::: Ã(x;xn)

¤T ;sowi high if Ã(x;xi) high

Page 42: Similarity-based Classifiers: Problems and Solutions

42

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

minimizew12w

TSw¡ sTw+ ¸2w

Twsubject to w ¸ 0; 1Tw=1:

diversity:12w

TSw= 12X

i;jÃ(xi ;xj )wiwj

Page 43: Similarity-based Classifiers: Problems and Solutions

43

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

minimizew12w

TSw¡ sTw+ ¸2w

Twsubject to w ¸ 0; 1Tw=1:

MakeS PSD,problem is a QP

QP w/ box constraintsCan solvewith SMO

Page 44: Similarity-based Classifiers: Problems and Solutions

44

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

Remove the constraints on the weights:

Can show equivalent to local ridge regression:KRR weights.

argminw

12w

TSw¡ sTw+ ¸2w

Tw

subject to w ¸ 0; 1Tw=1:

argminw

12w

TSw¡ sTw+ ¸2w

Tw

´ (S +¸I )¡ 1s

Page 45: Similarity-based Classifiers: Problems and Solutions

45

Weighted k-NN: Example 1

S =

2664

5 0 0 00 5 0 00 0 5 00 0 0 5

3775 ; s =

2664

4321

3775

wKRI =arg minw¸ 0;1T w=1

12w

TSw¡ sTw+ ¸2w

Tw

KRI weights

10-2 100 1020

0.1

0.25

0.4

0.5

0.6

¸

1

w4

1

w3

1

w2

1

w1

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 102-0.1

0

0.1

0.25

0.4

0.5

0.6

¸

1

w2

1

w1

1

w3

1

w4

1

¸

Page 46: Similarity-based Classifiers: Problems and Solutions

46

Weighted k-NN: Example 2

S =

2664

5 1 1 11 5 4 21 4 5 21 2 2 5

3775 ; s =

2664

3333

3775

wKRI =arg minw¸ 0;1T w=1

12w

TSw¡ sTw+ ¸2w

Tw

KRI weights

10-2 100 1020.15

0.2

0.25

0.3

0.35

0.4

¸

1

w2, w3

1

w4

1

w1

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 1020.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

¸

1

w2, w3

1

w4

1

w1

1

¸

Page 47: Similarity-based Classifiers: Problems and Solutions

47

Weighted k-NN: Example 3

S =

2664

5 1 1 11 5 4 21 4 5 21 2 2 5

3775 ; s =

2664

2433

3775

wKRI =arg minw¸ 0;1T w=1

12w

TSw¡ sTw+ ¸2w

Tw

KRI weights

10-2 100 1020

0.1

0.25

0.4

0.5

0.6

0.7

¸

1

w1

1

w3

1

w4

1

w2

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 102-0.4

-0.2

0

0.250.4

0.6

0.8

1

¸

1

w4

1

w1

1

w3

1

w2

1

¸

Page 48: Similarity-based Classifiers: Problems and Solutions

48

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Page 49: Similarity-based Classifiers: Problems and Solutions

49

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Page 50: Similarity-based Classifiers: Problems and Solutions

50

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Page 51: Similarity-based Classifiers: Problems and Solutions

51

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

Page 52: Similarity-based Classifiers: Problems and Solutions

52

Approaches to Similarity-based Classification.

MDSSimilarit

ies as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

Page 53: Similarity-based Classifiers: Problems and Solutions

53

Generative ClassifiersModel theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

Pro: Produces class probabilities

Page 54: Similarity-based Classifiers: Problems and Solutions

54

Generative ClassifiersModel theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

classdescriptivestatistics of s

Our Goal: Model P (T(s)jg)

Weuse: T(s) = [Ã(x;¹ 1);Ã(x;¹ 2);:: :;Ã(x;¹ G)]¹ h is a centroid for each class

Page 55: Similarity-based Classifiers: Problems and Solutions

55

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

Page 56: Similarity-based Classifiers: Problems and Solutions

56

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

Reg. Local SDAPerformance: Competitive

Page 57: Similarity-based Classifiers: Problems and Solutions

57

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

Page 58: Similarity-based Classifiers: Problems and Solutions

58

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

Page 59: Similarity-based Classifiers: Problems and Solutions

59

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

Page 60: Similarity-based Classifiers: Problems and Solutions

60

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

Page 61: Similarity-based Classifiers: Problems and Solutions

61

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

Page 62: Similarity-based Classifiers: Problems and Solutions

62

Lots of Open QuestionsMaking S PSD.

Fast k-NN search for similarities

Similarity-based regression

Relationship with learning on graphs

Try it out on real data

Fusion with Euclidean features (see our FUSION 2009 papers)

Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)

Page 63: Similarity-based Classifiers: Problems and Solutions

Code/Data/Papers: idl.ee.washington.edu/similaritylearning

Similarity-based Classification by Chen et al., JMLR 2009

Page 64: Similarity-based Classifiers: Problems and Solutions

64

Training and Test ConsistencyFor a test sample x, given , shall we

classify x ass = £Ã(x;x1) :: : Ã(x;xn)

¤T

y = sgn((c?)T s+b?) ?

No! If a training sample was used as a test sample, could change its class!

Page 65: Similarity-based Classifiers: Problems and Solutions

65

Data Sets

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

20 40 60 80 100

20

40

60

80

10020 40 60 80 100 120 140

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

-10

0

10

20

30

40

50

60

70

Eigenvalue Rank

Eig

enva

lue

0 10 20 30 40 50 60 70 80 90-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalue Rank

Eig

enva

lue

0 10 20 30 40 50 60 70 80 90-5

0

5

10

15

20

25

30

35

Eigenvalue Rank

Eig

enva

lue

Amazon Aural Sonar Protein

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

Eige

nval

ue

Eige

nval

ue

Eige

nval

ue

Page 66: Similarity-based Classifiers: Problems and Solutions

66

Data Sets

50 100 150 200

50

100

150

200100 200 300 400

100

200

300

400

50 100 150 200

50

100

150

200

0 50 100 150 200 250 300 350 400-50

0

50

100

150

200

250

Eigenvalue Rank

Eig

enva

lue

Voting Yeast-5-7 Yeast-5-12

0 20 40 60 80 100 120 140 160 180-20

0

20

40

60

80

100

120

Eigenvalue Rank

Eig

enva

lue

0 20 40 60 80 100 120 140 160 180-20

0

20

40

60

80

100

120

Eigenvalue Rank

Eig

enva

lue

Eige

nval

ue

Eige

nval

ue

Eige

nval

ue

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

Page 67: Similarity-based Classifiers: Problems and Solutions

67

SVM ReviewEmpirical risk minimization (ERM) with regularization:

minimizef 2HK

1n

nX

i=1L(f (xi);yi) +´kf k2K

Hinge loss:L(f (x);y) =max(1¡ yf (x);0)

SVM Primal:

minimizec;b;»

1n1

T»+´cTK c

subject to diag(y)(K c+b1) ¸ 1 ¡ »; »¸ 0:

0 1 2 1 2 ( )yf x

1

Lhinge loss

0-1 loss

Page 68: Similarity-based Classifiers: Problems and Solutions

68

Learning the Kernel MatrixFind for classification the best K regularized toward S:

minK º 0

minf 2HK

1n

nX

i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF

SVM that learns the full kernel matrix:

minimizec;b;»;K

1n1

T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;

»¸ 0; K º 0:

Page 69: Similarity-based Classifiers: Problems and Solutions

69

Related Work

Robust SVM (Luss & d’Aspremont, 2007):

SVM Dual:

maximize® 1T®¡ 12®

T diag(y)K diag(y)®subject to yT®=0; 0 · ®· C1:

maximize® minK º 0

µ1T®¡ 1

2®Tdiag(y)K diag(y)®+½kK ¡ Sk2F

subject to yT®=0; 0 · ®· C1:

“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”

Page 70: Similarity-based Classifiers: Problems and Solutions

70

Related WorkLet

A = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡ 12®

Tdiag(y)K diag(y)®+½kK ¡ Sk2F

Theorem (Sion, 1958)Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then

sup¹ 2M infº2N f (¹ ;º) = infº2N sup¹ 2M f (¹ ;º):

Page 71: Similarity-based Classifiers: Problems and Solutions

71

Related WorkLet

A = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡ 12®

Tdiag(y)K diag(y)®+½kK ¡ Sk2F

By Sion’s minimax theorem, the robust SVM is equivalent to:

minK º 0

max®2A

1T®¡ 12®

Tdiag(y)K diag(y)®+½kK ¡ Sk2F

Compare

minK º 0

minf 2HK

1n

nX

i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF

L(x;¸?) or f (x)

L(x?;¸) or g(¸) x ¸

zero duality gap

Page 72: Similarity-based Classifiers: Problems and Solutions

72

Learning the Kernel MatrixIt is not trivial to directly solve:

minimizec;b;»;K

1n1

T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;

»¸ 0; K º 0:

Lemma (Generalized Schur Complement)Let , and . Then

if and only if , z is in the range of K, and .

· K zzT u

¸º 0

u ¡ zTK yz ¸ 0

K 2 Rn£n z 2 Rn u 2 R

K º 0

Let , and notice that since .z = K c cTK c= zTK yz K K yK = K

Page 73: Similarity-based Classifiers: Problems and Solutions

73

Learning the Kernel MatrixIt is not trivial to directly solve:

minimizec;b;»;K

1n1

T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;

»¸ 0; K º 0:

However, it can be expressed as a convex conic program:

minimizez;b;»;K ;u;v

1n1

T»+´u+°v

subject to diag(y)(z +b1) ¸ 1 ¡ »; »¸ 0;· K zzT u

¸º 0; kK ¡ SkF · v:

– We can recover the optimal by .c? c? = (K ?)yz?

Page 74: Similarity-based Classifiers: Problems and Solutions

74

Learning the Spectrum ModificationConcerns about learning the full kernel matrix:

– Though the problem is convex, the number of variables is O(n2).

– The flexibility of the model may lead to overfitting.