building sparse support vector machines for multi-instance ... filetraining a standard svm at bag...

37
Building Sparse Support Vector Machines for Multi-Instance Classification Zhouyu Fu, Guojun Lu, Kai Ming Ting, Dengsheng Zhang Gippsland School of IT, Monash University September 8, 2011 Fu et al. (Monash) sparseMI September 8, 2011 1 / 37

Upload: others

Post on 28-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Building Sparse Support Vector Machines forMulti-Instance Classification

Zhouyu Fu, Guojun Lu, Kai Ming Ting, Dengsheng Zhang

Gippsland School of IT, Monash University

September 8, 2011

Fu et al. (Monash) sparseMI September 8, 2011 1 / 37

Page 2: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Outline

1 BackgroundMulti-Instance ClassificationSparse Support Vector Machines

2 Sparse Classifier Design for MI Classification“Label-Mean” FormulationSparse-MI Algorithm

3 Experimental ResultsSynthetic DataRealworld Data

4 Conclusions and Future Work

Fu et al. (Monash) sparseMI September 8, 2011 2 / 37

Page 3: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Problem Definition

Input

A data set of labelled bags {(Xi , yi )|i = 1, . . . , l} withXi = {xi ,1, . . . , xi ,ni} and yi ∈ {−1, 1} such that yi = maxnip=1 yi ,p

MI Assumption

A positive bag contains at least a positive instance

A negative bag contains negative instances only

Output

A classifier f : 2Rd → R for bag-level label prediction that minimizes the

bag classification error

Fu et al. (Monash) sparseMI September 8, 2011 3 / 37

Page 4: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

An Example of MI Classification

Fu et al. (Monash) sparseMI September 8, 2011 4 / 37

Page 5: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Applications of MI ClassificationDrug Activity Prediction

Region-based Image Classification

Fu et al. (Monash) sparseMI September 8, 2011 5 / 37

Page 6: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Prediction Function for Kernel SVM

f (x) =N∑

i=1,αi 6=0

αiκ(xi , x) + b

Proportional to the number of SVs (xi ’s with αi 6= 0)

Number of SVs further depends on the training data size and thecomplexity of decision boundary

Desirable to have SVM classifiers with a few SVs when predictionspeed is a major concern

Fu et al. (Monash) sparseMI September 8, 2011 6 / 37

Page 7: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Why Sparsity is Important for MI Classification?

Larger number of feature vectors to deal with as compared tostandard classification

Bag-level prediction usually involves going through all instances in thebag and accumulating the results

Challenge

To build a sparse SVM classifier for MI classification comprising fewer SVswithout compromising on performance

Fu et al. (Monash) sparseMI September 8, 2011 7 / 37

Page 8: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

“Label-Max” and Existing MI Formulations

MI Assumption

yi = maxp

yi ,p yi ∈ {−1, 1}, yi ,p ∈ {−1, 1}

This naturally translates to the “Label-Max” constraint in existing MIformulations (MI-SVM, mi-SVM, etc)

F (Xi ) = maxp

f (xi ,p)

Difficult optimization problem (non-differentiable, non-convex)

Learning sparse MI classifier further complicates the issue, withadditional constraints on f

Fu et al. (Monash) sparseMI September 8, 2011 8 / 37

Page 9: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

The “Label-Mean” Surrogate

Target Function

minf‖f ‖2 + C

∑i

`(F (Xi ), yi )

“Label-Mean” Constraints

F (Xi ) =1

ni

∑p

f (xi ,p)

Pros and Cons

Simple optimization problem with optimality guarantee

Violation of MI assumption, wrong model?

Not necessarily the case.

Fu et al. (Monash) sparseMI September 8, 2011 9 / 37

Page 10: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

The “Label-Mean” Surrogate

Target Function

minf‖f ‖2 + C

∑i

`(F (Xi ), yi )

“Label-Mean” Constraints

F (Xi ) =1

ni

∑p

f (xi ,p)

Pros and Cons

Simple optimization problem with optimality guarantee

Violation of MI assumption, wrong model?Not necessarily the case.

Fu et al. (Monash) sparseMI September 8, 2011 10 / 37

Page 11: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Connections with MI-Kernel

Key results

Lemma 1: Training MI-SVM with Label-Mean is equivalent totraining a standard SVM at bag level with the normalized set kernel.

Theorem 1: If positive and negative instances are separable withrespect to feature map φ in the RKHS induced by kernel κ, then forsufficiently large integer r , positive and negative bags are separableusing the Label-Mean prediction function with instance-level kernel κr .

Fu et al. (Monash) sparseMI September 8, 2011 11 / 37

Page 12: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Intuitions

If instances are linearly separable, then bags are separable by apolynomial kernel.

If instances are separable by a Gaussian kernel, then bags areseparable by a Gaussian kernel with larger bandwidth.

Note instance prediction function f is a kernel classifier (a linearclassifier is unlikely to work at bag level)

We used Gaussian kernel in our work

Fu et al. (Monash) sparseMI September 8, 2011 12 / 37

Page 13: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Prediction with “Label-Mean”

Instance classifier

f (x) =N∑i=1

|Xi |∑p=1

αiyiκ(xi ,p, x) + b

Bag classifier

F (X ) =1

|X |∑x∈X

f (x)

The complexity depends on

size of testing bag

nonzero coefficients αi ’s

size of training bags with nonzero coefficients

Fu et al. (Monash) sparseMI September 8, 2011 13 / 37

Page 14: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Prediction with “Label-Mean”

F (X ) =1

|X |

|X |∑q=1

l∑i=1

αiyi1

ni

ni∑p=1

κ(xi ,p, xq) + b

=1

|X |

|X |∑q=1

N∑j=1

βjκ(zj , xq) + b

{z1, . . . , zn1+1, . . . , zN} ⇐⇒ {x1,1, . . . , x2,1, . . . , xl ,nl}

{β1, . . . , βn1+1, . . . , βN} ⇐⇒ {1

n1y1α1, . . . ,

1

n2y2α2, . . . ,

1

nlylαl}

We want to reduce z’s in the expansion!

Fu et al. (Monash) sparseMI September 8, 2011 14 / 37

Page 15: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

The Sparse-MI Problem

Objective

Learn f (.) =∑Nsv

j=1 βjκ(zj , .) with small Nsv for MI classification

Alternatives

RSVM - select Nsv zj ’s randomly from training set

RS - approximate a learned classifier f̂ by minimizing ‖f̂ − f ‖2

MILES - enforce sparsity on the coefficients using L1 norm (cannotexplicitly specify Nsv )

Fu et al. (Monash) sparseMI September 8, 2011 15 / 37

Page 16: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Proposed Approach

minβ,b,Z

Q(β, b,Z) =βTKZβ + C∑i

`(yi ,F (Xi ))

F (Xi ) =1

ni

ni∑p=1

Nsv∑j=1

(βjκ(zj , xi ,p) + b)

Squared Hinge loss is used `(y ,F ) = max (0, 1− yF )2

Joint learning of classifier weights and SVs in a discriminative fashion

SVs not necessarily overlap with training instances

Non-convex, but convex and differentiable in β and b given Z

Fu et al. (Monash) sparseMI September 8, 2011 16 / 37

Page 17: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Optimization Scheme

Reformulation of the optimization problem

minZ

g(Z)

g(Z) = minβ,b

Q(β, b,Z)

What’s special about g(Z) -it depends on the solution of another optimization problem

Isn’t it more difficult to optimize g(Z)?

-Answer: Not necessarily!

Fu et al. (Monash) sparseMI September 8, 2011 17 / 37

Page 18: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Optimization Scheme

Reformulation of the optimization problem

minZ

g(Z)

g(Z) = minβ,b

Q(β, b,Z)

What’s special about g(Z) -it depends on the solution of another optimization problem

Isn’t it more difficult to optimize g(Z)? -Answer: Not necessarily!

Fu et al. (Monash) sparseMI September 8, 2011 18 / 37

Page 19: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

The Optimal Value Function

For our problem, g(Z) not only exists but is also differentiable

Conditions for differentiability (Bonnans 1998)

Uniqueness of optimal solution (strict convexity) for β and b given Z- unique value for g(Z) at each Z

Continuous differentiablity of function Q over β and b- avoid drastic change of g(Z) over local neighbourhood, i.e.

differentiability

Fu et al. (Monash) sparseMI September 8, 2011 19 / 37

Page 20: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Computation of derivatives

We can compute the derivative of g(Z) for each Z as if it does not dependon β and b

∂g

∂zj=

N∑i=1

βiβj∂κ(zi , zj)

∂zj

+ Cm∑i=1

1

ni`′(yi , fi )βj

ni∑p=1

∂κ(xi ,p, zj)

∂zj

βj ’s are the optimal values of βj ’s

Fu et al. (Monash) sparseMI September 8, 2011 20 / 37

Page 21: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Algorithm

Fu et al. (Monash) sparseMI September 8, 2011 21 / 37

Page 22: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Remarks

Same optimization strategy has been adopted in solving otherproblems (SimpleMKL, SKLA)

Sparse-MI is most related to SKLA with two main differencesI SKLA can not be used to handle MI classification problems (need a MI

formulation with unique optimal solution)I SKLA performs optimization and update of SVs in the dual formulation

(optimization in the primal is much more efficient)

Fu et al. (Monash) sparseMI September 8, 2011 22 / 37

Page 23: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Multi-class Sparse MI Classifier

One-vs-all scheme - converts to multiple binary MI classificationproblems

zj ’s are learned jointly - same XVs for different pairs of classifiers

∂g

∂zj=

M∑c=1

Nsv∑i=1

βci β

cj

∂κ(zi , zj)

∂zj

+CM∑c=1

m∑i=1

1

ni`′(y ci , fi )β

cj

ni∑p=1

∂κ(xi ,p, zj)

∂zj

Fu et al. (Monash) sparseMI September 8, 2011 23 / 37

Page 24: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 1

Data Set

Fu et al. (Monash) sparseMI September 8, 2011 24 / 37

Page 25: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 1

Iteration 1

Fu et al. (Monash) sparseMI September 8, 2011 25 / 37

Page 26: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 1

Iteration 10

Fu et al. (Monash) sparseMI September 8, 2011 26 / 37

Page 27: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 1

Function values over iterations

Fu et al. (Monash) sparseMI September 8, 2011 27 / 37

Page 28: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 2

Data set

Fu et al. (Monash) sparseMI September 8, 2011 28 / 37

Page 29: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 2

Iteration 1

Fu et al. (Monash) sparseMI September 8, 2011 29 / 37

Page 30: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 2

Iteration 10

Fu et al. (Monash) sparseMI September 8, 2011 30 / 37

Page 31: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Synthetic Data 2

Function values over iterations

Fu et al. (Monash) sparseMI September 8, 2011 31 / 37

Page 32: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Data Set Descriptions

Drug Actitity ClassificationI MUSK1 - 47 positive and 45 negative, 5.2 instances per bagI MUSK2 - 39 positive and 63 negative, 64.7 instances per bag

Image ClassificationI COREL10 - 10 classes, 100 images per classI COREL20 - 20 classes, 100 images per classI 2− 13 regions per image

Music Classification

I GENRE - 10 classes, 100 songs per class, 30 segments per song

Fu et al. (Monash) sparseMI September 8, 2011 32 / 37

Page 33: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Performance Comparison

Data set MUSK1 MUSK2 COREL10 COREL20 GENRE

mi-SVM87.30% 80.52% 75.62% 52.15% 80.28%400.2 2029.2 1458.2 2783.3 12439

MI-SVM77.10% 83.20% 74.35% 55.37% 72.48%277.1 583.4 977 2300.1 3639.8

MI-Kernel89.93% 90.26% 84.30% 73.19% 77.05%362.4 3501 1692.6 3612.7 13844

MILES85.73% 87.64% 82.86% 69.16% 69.87%

40.8 42.5 379.1 868 1127.9

Nsv = 10RS 88.68% 86.39% 75.13% 55.20% 54.68%

RSVM 74.66% 77.70% 69.25% 48.53% 46.01%SparseMI 88.44% 88.52% 80.10% 62.49% 71.28%

Nsv = 100RS 90.18% 89.16% 78.81% 65.35% 67.77%

RSVM 89.02% 88.26% 77.63% 63.82% 67.41%SparseMI 90.40% 87.98% 84.31% 72.22% 76.06%

Fu et al. (Monash) sparseMI September 8, 2011 33 / 37

Page 34: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Convergence ResultsCOREL20

Fu et al. (Monash) sparseMI September 8, 2011 34 / 37

Page 35: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Convergence ResultsGENRE

Fu et al. (Monash) sparseMI September 8, 2011 35 / 37

Page 36: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Conclusions

A sparse SVM classifier proposed for MI classification

Joint optimization of SVs and classifier weights

Efficient learning in the primal formulation

With controlled sparsity

Comparable performance but much more efficient in testing

Fu et al. (Monash) sparseMI September 8, 2011 36 / 37

Page 37: Building Sparse Support Vector Machines for Multi-Instance ... filetraining a standard SVM at bag level with the normalized set kernel. Theorem 1: If positive and negative instances

Future Work

Sparse MI classification with non i.i.d. instance distributions

Incorporation with alternative convex MI classifiers

Constrained SV selection

Standard sparse SVM learning from the primal formulation

Fu et al. (Monash) sparseMI September 8, 2011 37 / 37