selecting from an infinite set of features in svm

1
Selecting from an infinite set of features in SVM emi Flamary, Florian Yger, Alain Rakotomamonjy LITIS EA 4108, Universit´ e de Rouen 76800 Saint Etienne du Rouvray, France How to extract features? I Continuous parameters for feature extractions: Infinite set. I Select from a finite number of values by Cross Validation or MKL [1]: limited to small number of parameters. I Infinite MKL [2] for continuous parameters: limited to small scale datasets. I We propose an active set algorithm for feature extraction and classifier learning: learning from continuously parametrized features for large scale datasets. Examples of infinite sets I 2D Gabor functions for texture recognition. g (u , v )= e -( u 2 2σ 1 + v 2 2σ 2 ) e i π f (u cos θ +v sin θ ) 4 parameters: θ, f 1 2 . I Signal filtering for Brain-Computer Interfaces. For Motor Imagery, a [f min , f max ] bandpass filtering is applied to the signals. 2 parameters: f min , f max . Framework I n training examples {x i , y i } n i =1 with x i ∈X and y i ∈ {-1, 1}. I φ θ (·) is a θ parametrized feature extraction. I The decision function is: f (x)= N X j =1 hw j θ j (x)i X θ j (1) where some of the w j are 0. I Φ is the matrix of feature maps, resulting from the concatenation of the N matrices {Φ θ j }. I Φ is normalized to unit norm and e Φ= diag(y. Fixed number of features I Optimization problem: min w,b J (w)= C 2n (1I- e Φw) T + (1I- e Φw) + +Ω(w) (2) where [ e Φw] i = f (x i ), 1I is a unitary vector, (·) + = max(0,.) is the element-wise positive part of a vector, Ω is a 1 - 2 norm. I Optimality conditions are: -r i + w i ||w i || 2 = ~ 0 i w i 6= ~ 0 ||r i || 2 1 i w i = ~ 0 (3) with r i = C n e Φ T i (1I - e Φw) + . Active Set Algorithm 1: Set A = initial active set 2: Set w = ~ 0 3: repeat 4: w solve problem (2) with features from A 5: r , i max i ∈A c ||r i || 2 6: if r > 1 then 7: A = A∪ i 8: end if 9: until r 1 I The most violating feature is added for convergence speed (Line 5). I Sub-problem solved quickly with an Fast Iterative Shrinkage Algorithm [3] (Line 4). Extension to the infinite set I Aim: find a finite set Θ of features minimizing J (w). I The new optimality conditions are: -r i + w i ||w i || 2 = ~ 0 i w i 6= ~ 0 ||r i || 2 1 i w i = ~ 0 k e Φ T θ s (1I - e Φw) + k 2 1 θ s 6Θ (4) I Not possible to check optimality θ . I Optimality checked on a randomly drawn finite set of θ s 6Θ (Line 5). I Add the most violating feature from this random subset to the active set. Figure 1: Textures D29 (left) and D92 (right) from the Brodatz Dataset Texture Recognition Dataset I Classifying 16×16 patches from Brodatz textures D29 and D92. I Fixed and random 2D Gabor marginal features compared. I C has been set to 10. 10 2 10 3 10 4 10 5 0.8 0.82 0.84 0.86 0.88 0.9 Nb of training examples Accuracy # Gabor features = 81 Sampled Random 10 2 10 3 10 4 10 5 0.8 0.82 0.84 0.86 0.88 0.9 Nb of training examples Accuracy # Gabor features = 1296 Sampled Random Figure 2: Accuracy performance with different numbers of sampled features (left) 81. (right) 1296. BCI Dataset I Dataset IIa from BCI Competition IV. I Comparison between a fixed [8,30]Hz bandpass and a random bandpass of at least 20Hz inside [8,30]Hz. I A CSP [4] is applied to the filtered signals and the most discriminant spatial filters are kept. I The number of selected filters and C are chosen through Cross-Validation. Subjects Methods S1 S2 S3 S4 S5 S6 S7 S8 S9 Avg CSP [4] 88.89 51.39 96.53 70.14 54.86 71.53 81.25 93.75 93.74 78.01 Fixed 88.19 53.47 96.53 63.89 60.42 69.44 79.17 97.92 93.06 78.01 Random 90.97 52.78 95.14 73.61 62.50 72.92 82.64 97.22 92.36 80.01 Table 1: Classification accuracy on the test set for classical CSP approach, fixed and random bandpass filter for feature extraction on the BCI dataset. Conclusion I Active set algorithm. I Handle large scale problems. I Automated selection of continuous parameters. References G. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, and M. Jordan. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research, 5:27–72, 2004. Peter Gehler and Sebastian Nowozin. Let the kernel figure it out: Principled learning of pre-processing for kernel classifiers. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2:183–202, 2009. F. Lotte and C. Guan. Regularizing common spatial patterns to improve bci designs: Unified theory and new algorithms. IEEE Trans Biomed Eng, to appear, 2010. http://www.litislab.eu {remi.flamary,florian.yger,alain.rakoto} @ insa-rouen.fr

Upload: others

Post on 24-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Selecting from an infinite set of features in SVM

Selecting from an infinite set offeatures in SVM

Remi Flamary, Florian Yger, Alain RakotomamonjyLITIS EA 4108, Universite de Rouen

76800 Saint Etienne du Rouvray, France

How to extract features?

I Continuous parameters for featureextractions:⇒ Infinite set.

I Select from a finite number of values byCross Validation or MKL [1]: limited tosmall number of parameters.

I Infinite MKL [2] for continuous parameters:limited to small scale datasets.

I We propose an active set algorithm forfeature extraction and classifier learning:learning from continuously parametrizedfeatures for large scale datasets.

Examples of infinite sets

I 2D Gabor functions for texture recognition.

g(u, v) = e−( u2

2σ1+ v2

2σ2)eiπf (u cos θ+v sin θ)

4 parameters: θ, f , σ1, σ2.I Signal filtering for Brain-Computer

Interfaces. For Motor Imagery, a[fmin, fmax ] bandpass filtering is applied tothe signals.2 parameters: fmin, fmax .

Framework

I n training examples xi , yini=1 with xi ∈ Xand yi ∈ −1,1.

I φθ(·) is a θ parametrized feature extraction.I The decision function is:

f (x) =N∑

j=1

〈wj , φθj(x)〉Xθj (1)

where some of the wj are 0.I Φ is the matrix of feature maps, resulting

from the concatenation of the N matricesΦθj.

I Φ is normalized to unit norm andΦ = diag(y)Φ.

Fixed number of features

I Optimization problem:

minw,b

J(w) =C2n

(1I−Φw)T+(1I−Φw)++Ω(w)

(2)where [Φw]i = f (xi), 1I is a unitary vector,(·)+ = max(0, .) is the element-wisepositive part of a vector, Ω is a `1 − `2norm.

I Optimality conditions are:

−ri +wi||wi ||2

= ~0 ∀i wi 6= ~0

||ri ||2 ≤ 1 ∀i wi = ~0(3)

with ri = Cn ΦT

i (1I− Φw)+.

Active Set Algorithm

1: Set A = ∅ initial active set2: Set w = ~03: repeat4: w← solve problem (2) with features

from A5: r , i ← maxi∈Ac ||ri ||26: if r > 1 then7: A = A ∪ i8: end if9: until r ≤ 1

I The most violating feature is added forconvergence speed (Line 5).

I Sub-problem solved quickly with an FastIterative Shrinkage Algorithm [3] (Line 4).

Extension to the infinite set

I Aim: find a finite set Θ of featuresminimizing J(w).

I The new optimality conditions are:

−ri +wi||wi ||2

= ~0 ∀i wi 6= ~0

||ri ||2 ≤ 1 ∀i wi = ~0‖ΦT

θs(1I− Φw)+‖2 ≤ 1 ∀ θs 6∈ Θ

(4)

I Not possible to check optimality ∀θ.I Optimality checked on a randomly drawn

finite set of θs 6∈ Θ (Line 5).I Add the most violating feature from this

random subset to the active set.

Figure 1: Textures D29 (left) and D92 (right) from theBrodatz Dataset

Texture Recognition Dataset

I Classifying 16×16 patches from Brodatztextures D29 and D92.

I Fixed and random 2D Gabor marginalfeatures compared.

I C has been set to 10.

102

103

104

105

0.8

0.82

0.84

0.86

0.88

0.9

Nb of training examples

Accura

cy

# Gabor features = 81

Sampled

Random

102

103

104

105

0.8

0.82

0.84

0.86

0.88

0.9

Nb of training examples

Accura

cy

# Gabor features = 1296

Sampled

Random

Figure 2: Accuracy performance with differentnumbers of sampled features (left) 81. (right) 1296.

BCI Dataset

I Dataset IIa from BCI Competition IV.I Comparison between a fixed [8,30]Hz bandpass

and a random bandpass of at least 20Hz inside[8,30]Hz.

I A CSP [4] is applied to the filtered signals and themost discriminant spatial filters are kept.

I The number of selected filters and C are chosenthrough Cross-Validation.

SubjectsMethods S1 S2 S3 S4 S5 S6 S7 S8 S9 AvgCSP [4] 88.89 51.39 96.53 70.14 54.86 71.53 81.25 93.75 93.74 78.01Fixed 88.19 53.47 96.53 63.89 60.42 69.44 79.17 97.92 93.06 78.01

Random 90.97 52.78 95.14 73.61 62.50 72.92 82.64 97.22 92.36 80.01

Table 1: Classification accuracy on the test set for classicalCSP approach, fixed and random bandpass filter for featureextraction on the BCI dataset.

Conclusion

I Active set algorithm.I Handle large scale problems.I Automated selection of continuous

parameters.

ReferencesG. Lanckriet, N. Cristianini, L. El Ghaoui, P. Bartlett, andM. Jordan.Learning the kernel matrix with semi-definite programming.Journal of Machine Learning Research, 5:27–72, 2004.

Peter Gehler and Sebastian Nowozin.Let the kernel figure it out: Principled learning ofpre-processing for kernel classifiers.In Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2009.

A. Beck and M. Teboulle.A fast iterative shrinkage-thresholding algorithm for linearinverse problems.SIAM Journal on Imaging Sciences, 2:183–202, 2009.

F. Lotte and C. Guan.Regularizing common spatial patterns to improve bcidesigns: Unified theory and new algorithms.IEEE Trans Biomed Eng, to appear, 2010.

http://www.litislab.eu remi.flamary,florian.yger,alain.rakoto @ insa-rouen.fr