similarity-based classifiers: problems and solutions

Similarity-based Classifiers:

Problems and Solutions

2

Classifying based on similarities:

Van GoghOr

Monet ?

Van Gogh

Monet

3

the Similarity-based Classification Problem

Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

(painter)(paintings)

4


Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;nUnderlyingSimilarity Function: Ã : Ð£ Ð ! R

Training Similarities: S = £Ã(xi;xj )¤n£n ; y =

£y1 : : : yn¤T

5


Training Samples: f (xi;yi)gni=1; xi 2 Ð; yi 2 G; i = 1;:::;n

Training Similarities: S = £Ã(xi;xj )¤n£n ; y =

£y1 : : : yn¤T

Test Similarities: s = £Ã(x;x1) : :: Ã(x;xn)¤T ; Ã(x;x)

Problem: Estimate theclass label yfor test samplex given S, y, s, and Ã(x;x).

UnderlyingSimilarity Function: Ã : Ð£ Ð ! R

?

6

Examples of Similarity FunctionsComputational Biology

– Smith-Waterman algorithm (Smith & Waterman, 1981)

– FASTA algorithm (Lipman & Pearson, 1985)– BLAST algorithm (Altschul et al., 1990)

Computer Vision– Tangent distance (Duda et al., 2001)– Earth mover’s distance (Rubner et al., 2000)– Shape matching distance (Belongie et al., 2002)– Pyramid match kernel (Grauman & Darrell, 2007)

Information Retrieval– Levenshtein distance (Levenshtein, 1966)– Cosine similarity between tf-idf vectors (Manning

& Schütze, 1999)

7

Approaches to Similarity-based Classification

MDSSimilariti

es as kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA

Classify x given S, y, s, and Ã(x;x).

8


MDSSimilariti

es as kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


9

Can we treat similarities as kernels?

Kernels are inner products in someHilbert space.

10


conjugatesymmetric, reallinear: hax;zi = a< x;z >positivede nite: hx;xi > 0unless x = 0

Example Inner Product hx;zi = xTz.

Properties of an Inner Product hx;zi :

Kernels are inner products in someHilbert space.x

zhx;zi

An inner product implies a norm: kxk=phx;xi

11


Kernels are inner products in someHilbert space.

Inner products aresimilarities.

Areour notions of similarities always inner products?No!

12

Example: Amazon similarity

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

Ð spaceof all books,Á(A;B) =%buy book A after viewing book B on Amazon

96 books

96 b

ooks S

Inner product-like?

13


10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90


96 books

96 b

ooks S

Á(HTF, Bishop) = 3Á(Bishop, HTF) = 8

assymmetric!

0 10 20 30 40 50 60 70 80 90-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalue Rank

Eig

enva

lue


10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90


96 books

96 b

ooks S

negative

Rank

Eige

nval

ues

Not PSD!

15

Well, let’s just make S be a kernel matrix

First, symmetrize:S Ã 1

2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Clip:Sclip =U diag(max(¸1;0); : : : ;max(¸n;0))UT

0 0

S

Sclip

PSD Cone

Sclip is thePSD matrix closest to Sin terms of theFrobenius norm.

16



2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Flip:S°ip =U diag(j¸1j; : : : ; j¸nj) UT

0 0

(similar e®ect: Snew =STS)

17



2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

0 0

Shift:Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

18



2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

0 0

Sshift =U (¤ + jmin(¸min(S);0)j I ) UT

Flip, Clip or Shift?Best bet is Clip.

19



2(S +ST ) ) S = U¤UT ;¤ = diag(¸1; : : : ;¸n)

Learn the best kernel matrix for the SVM:(Luss NIPS 2007, Chen et al. ICML 2009)

minK º 0

minf 2H K

1n

nX

i=1L(f (xi);yi) +´kf k2K +°kK ¡ SkF

20

Approaches to Similarity-based Classification.

MDSSimilariti

es as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


21

Let the similarities to the training samples be features

– SVM (Graepel et al., 1998; Liao & Noble, 2003)– Linear programming (LP) machine (Graepel et al.,

1999)– Linear discriminant analysis (LDA) (Pekalska et al.,

2001)– Quadratic discriminant analysis (QDA) (Pekalska &

Duin, 2002)– Potential support vector machine (P-SVM) (Hochreiter

& Obermayer, 2006; Knebel et al., 2008)

Let £Ã(x;x1) :: : Ã(x;xn)¤T 2 Rn be the featurevector for x.

minimize®12ky ¡ S®k

22+²k®k1+°k®k1

Asymptotically does thiswork?Our results suggest you need to choosea slow-growing subset of n.

22

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

23

AMAZON47 classes

AURAL SONAR2 classes

CALTECH101 classes

FACE REC139 classes

MIREX10 classes

VOTING VDM2 classes

# samples

n = 204 n =100

n = 8677

n = 945 n = 3090

n = 435

SVM-kNN(clip)(Zhang et al. 2006)

17.56 13.75 36.82 4.23 61.25 5.23

SVM (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear)

76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF)

75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

24


MDSSimilariti

es as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


25

Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

kX

i=1wi I f yi=gg

?

26

Weighted Nearest-NeighborsTake a weighted vote of the k-nearest-neighbors:

Algorithmic parallel of the exemplar model of human learning.

y = argmaxg2G

kX

i=1wi I f yi=gg

P (Y = gjX = x) =kX

i=1wi I f yi=gg

For wi ¸ 0 and P i wi =1, get class posterior estimate:

Good for asymmetric costsGood for interpretationGood for system integration.

27

Design Goals for the Weights

?

28


Design Goal 1 (Affinity): wi should be an increasing function of ψ(x, xi).

?

29


?

30

Design Goals for the Weights (Chen et al. JMLR 2009)

Design Goal 2 (Diversity): wi should be a decreasing function of ψ(xi, xj).

?

31

Linear Interpolation WeightsLinear interpolation weights will meet these goals:

X

iwixi = x; such that wi ¸ 0;

X

iwi =1

x1x2 x3

x4x

non-uniquesolution

32

Linear Interpolation WeightsLinear interpolation weights will meet these goals:

X


X

iwi =1

x1x2 x3

x4x

non-uniquesolution

x1x2 x3

x4 x

no solution

33

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

LIME weightsLinear interpolation weights will meet these goals:

Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006):

X


X

iwi =1

34

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:



X


X

iwi =1

35



minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

X


X

iwi =1

maximumentropy ! push weights to beequal

36



minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

X


X

iwi =1

maximumentropy = exponential solutionconsistent (Friedlander Gupta IEEE IT 2005)noiseaveraging

37

Kernelize Linear Interpolation (Chen et al. JMLR 2009)

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w

Twsubject to w ¸ 0; 1Tw= 1;

LIME weights:

Let X = [x1; : : : xk], re-writewith matricesand change to ridge regularizer:

38

Kernelize Linear Interpolation

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w


LIME weights:


regularizes the variance of the weights

39

Kernelize Linear Interpolation

minimizew

°°°°°kX

i=1wixi ¡ x

°°°°°

2

2+¸

kX

i=1wi logwi

subject tokX

i=1wi =1; wi ¸ 0; i = 1;::: ;k:

minimizew12w

TX TXw¡ xTXw+ ¸2w


LIME weights:


only need inner products – can replace with kernel or similarities!

40

minimizew12w

TSw¡ sTw+ ¸2w

Twsubject to w ¸ 0; 1Tw=1:

KRI Weights Satisfy Design GoalsKernel ridge interpolation (KRI) weights:

41


minimizew12w

TSw¡ sTw+ ¸2w


affinity: s = £Ã(x;x1) ::: Ã(x;xn)

¤T ;sowi high if Ã(x;xi) high

42


minimizew12w

TSw¡ sTw+ ¸2w


diversity:12w

TSw= 12X

i;jÃ(xi ;xj )wiwj

43


minimizew12w

TSw¡ sTw+ ¸2w


MakeS PSD,problem is a QP

QP w/ box constraintsCan solvewith SMO

44


Remove the constraints on the weights:

Can show equivalent to local ridge regression:KRR weights.

argminw

12w

TSw¡ sTw+ ¸2w

Tw

subject to w ¸ 0; 1Tw=1:

argminw

12w

TSw¡ sTw+ ¸2w

Tw

´ (S +¸I )¡ 1s

45

Weighted k-NN: Example 1

S =

2664

5 0 0 00 5 0 00 0 5 00 0 0 5

3775 ; s =

2664

4321

3775

wKRI =arg minw¸ 0;1T w=1

12w

TSw¡ sTw+ ¸2w

Tw

KRI weights

10-2 100 1020

0.1

0.25

0.4

0.5

0.6

¸

1

w4

1

w3

1

w2

1

w1

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 102-0.1

0

0.1

0.25

0.4

0.5

0.6

¸

1

w2

1

w1

1

w3

1

w4

1

¸

46


S =

2664

5 1 1 11 5 4 21 4 5 21 2 2 5

3775 ; s =

2664

3333

3775


12w

TSw¡ sTw+ ¸2w

Tw

KRI weights

10-2 100 1020.15

0.2

0.25

0.3

0.35

0.4

¸

1

w2, w3

1

w4

1

w1

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 1020.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

¸

1

w2, w3

1

w4

1

w1

1

¸

47


S =

2664

5 1 1 11 5 4 21 4 5 21 2 2 5

3775 ; s =

2664

2433

3775


12w

TSw¡ sTw+ ¸2w

Tw

KRI weights

10-2 100 1020

0.1

0.25

0.4

0.5

0.6

0.7

¸

1

w1

1

w3

1

w4

1

w2

1

¸

wKRR = (S +¸I )¡ 1s

KRR weights

10-2 100 102-0.4

-0.2

0

0.250.4

0.6

0.8

1

¸

1

w4

1

w1

1

w3

1

w2

1

¸

48

Amazon-47

Aural Sonar

Caltech-101

Face Rec Mirex Voting

# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL

SVM sim-as-kernel (clip) 81.24 13.00 33.49 4.18 57.83 4.89

SVM sim-as-feature (linear) 76.10 14.25 38.18 4.29 55.54 5.40

SVM sim-as-feature (RBF) 75.98 14.25 38.16 3.92 55.72 5.52

P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

49

Amazon-47

Aural Sonar

Caltech-101


# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL




P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

50

Amazon-47

Aural Sonar

Caltech-101


# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL




P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

51

Amazon-47

Aural Sonar

Caltech-101


# samples 204 100 8677 945 3090 435

# classes 47 2 101 139 10 2

LOCAL

k-NN 16.95 17.00 41.55 4.23 61.21 5.80

affinity k-NN 15.00 15.00 39.20 4.23 61.15 5.86

KRI k-NN (clip) 17.68 14.00 30.13 4.15 61.20 5.29

KRR k-NN (pinv) 16.10 15.25 29.90 4.31 61.18 5.52

SVM-KNN (clip) 17.56 13.75 36.82 4.23 61.25 5.23

GLOBAL




P-SVM 70.12 14.25 34.23 4.05 63.81 5.34

52

Approaches to Similarity-based Classification.

MDSSimilarit

ies as Kernels

SVM

Similarities as

features

theory

k-NN

weights

Generative

Models

SDA


53

Generative ClassifiersModel theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

Pro: Produces class probabilities

54

Generative ClassifiersModel theprobability of what you seegiven each class:

Linear discriminant analysisQuadratic discriminant analysisGaussian mixturemodels...

classdescriptivestatistics of s

Our Goal: Model P (T(s)jg)

Weuse: T(s) = [Ã(x;¹ 1);Ã(x;¹ 2);:: :;Ã(x;¹ G)]¹ h is a centroid for each class

55

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

56

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009)

AssumeG similaritiesclass-conditionally independent

Reducemodel bias by applying locally (local SDA)

Reduceest. varianceby regularizing over localities

Model P (T(s)jg)

EstimateP (Ã(x;¹ hjg) asmax-ent distr.given empirical mean. Result is exponential.

Reg. Local SDAPerformance: Competitive

57

Some ConclusionsPerformance depends heavily on oddities of each dataset

Weighted k-NN with affinity-diversity weights work well.

Preliminary: Reg. Local SDA works well.

Probabilities useful .

Local models useful - less approximating- hard to model entire space, underlying manifold? - always feasible

58






59






60






61






62

Lots of Open QuestionsMaking S PSD.

Fast k-NN search for similarities

Similarity-based regression

Relationship with learning on graphs

Try it out on real data

Fusion with Euclidean features (see our FUSION 2009 papers)

Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008)

Code/Data/Papers: idl.ee.washington.edu/similaritylearning

Similarity-based Classification by Chen et al., JMLR 2009

64

Training and Test ConsistencyFor a test sample x, given , shall we

classify x ass = £Ã(x;x1) :: : Ã(x;xn)

¤T

y = sgn((c?)T s+b?) ?

No! If a training sample was used as a test sample, could change its class!

65

Data Sets

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90

20 40 60 80 100

20

40

60

80

10020 40 60 80 100 120 140

20

40

60

80

100

120

140

0 20 40 60 80 100 120 140

-10

0

10

20

30

40

50

60

70

Eigenvalue Rank

Eig

enva

lue

0 10 20 30 40 50 60 70 80 90-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Eigenvalue Rank

Eig

enva

lue

0 10 20 30 40 50 60 70 80 90-5

0

5

10

15

20

25

30

35

Eigenvalue Rank

Eig

enva

lue

Amazon Aural Sonar Protein

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

Eige

nval

ue

Eige

nval

ue

Eige

nval

ue

66

Data Sets

50 100 150 200

50

100

150

200100 200 300 400

100

200

300

400

50 100 150 200

50

100

150

200

0 50 100 150 200 250 300 350 400-50

0

50

100

150

200

250

Eigenvalue Rank

Eig

enva

lue

Voting Yeast-5-7 Yeast-5-12

0 20 40 60 80 100 120 140 160 180-20

0

20

40

60

80

100

120

Eigenvalue Rank

Eig

enva

lue

0 20 40 60 80 100 120 140 160 180-20

0

20

40

60

80

100

120

Eigenvalue Rank

Eig

enva

lue

Eige

nval

ue

Eige

nval

ue

Eige

nval

ue

Eigenvalue Rank

Eigenvalue Rank

Eigenvalue Rank

67

SVM ReviewEmpirical risk minimization (ERM) with regularization:

minimizef 2HK

1n

nX

i=1L(f (xi);yi) +´kf k2K

Hinge loss:L(f (x);y) =max(1¡ yf (x);0)

SVM Primal:

minimizec;b;»

1n1

T»+´cTK c

subject to diag(y)(K c+b1) ¸ 1 ¡ »; »¸ 0:

0 1 2 1 2 ( )yf x

1

Lhinge loss

0-1 loss

68

Learning the Kernel MatrixFind for classification the best K regularized toward S:

minK º 0

minf 2HK

1n

nX


SVM that learns the full kernel matrix:

minimizec;b;»;K

1n1

T»+´cTK c+°kK ¡ SkFsubject to diag(y)(K c+b1) ¸ 1 ¡ »;

»¸ 0; K º 0:

69

Related Work

Robust SVM (Luss & d’Aspremont, 2007):

SVM Dual:

maximize® 1T®¡ 12®

T diag(y)K diag(y)®subject to yT®=0; 0 · ®· C1:

maximize® minK º 0

µ1T®¡ 1

2®Tdiag(y)K diag(y)®+½kK ¡ Sk2F

¶

subject to yT®=0; 0 · ®· C1:

“This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”

70

Related WorkLet

A = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡ 12®

Tdiag(y)K diag(y)®+½kK ¡ Sk2F

Theorem (Sion, 1958)Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M N, which is quasiconcave in M, quasiconvex in N, upper semi-continuous in μ for each ν N, and lower semi-continuous in ν for each μ M, then

sup¹ 2M infº2N f (¹ ;º) = infº2N sup¹ 2M f (¹ ;º):

71

Related WorkLet

A = f®2n j yT®=0; 0 · ®· C1g

Rewrite the robust SVM as

max®2A

minK º 0

1T®¡ 12®


By Sion’s minimax theorem, the robust SVM is equivalent to:

minK º 0

max®2A

1T®¡ 12®


Compare

minK º 0

minf 2HK

1n

nX


L(x;¸?) or f (x)

L(x?;¸) or g(¸) x ¸

zero duality gap

72

Learning the Kernel MatrixIt is not trivial to directly solve:

minimizec;b;»;K

1n1


»¸ 0; K º 0:

Lemma (Generalized Schur Complement)Let , and . Then

if and only if , z is in the range of K, and .

· K zzT u

¸º 0

u ¡ zTK yz ¸ 0

K 2 Rn£n z 2 Rn u 2 R

K º 0

Let , and notice that since .z = K c cTK c= zTK yz K K yK = K

73

Learning the Kernel MatrixIt is not trivial to directly solve:

minimizec;b;»;K

1n1


»¸ 0; K º 0:

However, it can be expressed as a convex conic program:

minimizez;b;»;K ;u;v

1n1

T»+´u+°v

subject to diag(y)(z +b1) ¸ 1 ¡ »; »¸ 0;· K zzT u

¸º 0; kK ¡ SkF · v:

– We can recover the optimal by .c? c? = (K ?)yz?

74

Learning the Spectrum ModificationConcerns about learning the full kernel matrix:

– Though the problem is convex, the number of variables is O(n2).

– The flexibility of the model may lead to overfitting.

similarity-based classifiers: problems and solutions

Documents

similaritybased classifiers

svm sim

cosine similarity

svm clip81

design goals

kernel matrix18

kernel matrix19

kernel matrix15