feature weighting for multinomial kernel logistic regression and...

17
Neurocomputing 275 (2018) 1752–1768 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Feature weighting for multinomial kernel logistic regression and application to action recognition Ouiza Ouyed a , Mohand Said Allili a,b,a Department of Computer Science and Engineering, University of Quebec in Outaouais, Gatineau J8X 3X7, Quebec, Canada b Département d’informatique et d’ingénierie, Université du Québec en Outaouais 101, Rue St-Jean-Bosco, Local: B-2022, Gatineau J8X 3X7, Québec, Canada a r t i c l e i n f o Article history: Received 13 January 2017 Revised 13 October 2017 Accepted 23 October 2017 Available online 7 November 2017 Communicated by Dr. Xin Luo Keywords: Multinomial kernel logistic regression Feature relevance Sparse models Video action recognition a b s t r a c t Multinominal kernel logistic regression (MKLR) is a supervised classification method designed for sep- arating classes with non-linear boundaries. However, it relies on the assumption that all features are equally important, which may decrease classification performance when dealing with high-dimensional and noisy data. We propose an approach for embedding feature relevance in multinomial kernel logistic regression. Our approach, coined fr-MKLR, generalizes MKLR by introducing a feature weighting scheme in the Gaussian kernel and using the so-called 0 -“norm” as sparsity-promoting regularization. Therefore, the contribution of each feature is tuned according to its relevance for classification which leads to more generalizable and interpretable sparse models for classification. Application of our approach to several standard datasets and video action recognition has provided very promising results compared to other methods. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Sparse models for classification have received an increasing at- tention for supervised learning [41,59,60]. These include the least absolute shrinkage and selection operator(LASSO) [72,83,94], import vector machine (IVM) [97], sparse representation-based classification (SRC) [33,93,96] and many others. Generally speaking, model spar- sity in classifiers can come in two different flavours. First (instance selection), sparsity can refer to selecting subsets (or instances) of learning data that are representative for the classification at hand [10,33,93,96]. This approach is used, for example, in support vector machines (SVM) [86] and IVM [97] to produce compact sets of data (i.e., support vectors) used for classification. In SRC [93,96], classi- fication is based on minimizing the reconstruction error with re- gard to class data. Second (feature selection), sparsity refers to re- ducing the feature space to a small subset of dimensions carrying sufficient information for classification; this paper deals with this type of sparsity. Worth mentioning are also methods for learning representations using deep neural networks (DNNs), where rele- vant features are automatically extracted for recognition problems [8,29]. However, given the huge number of parameters required for representation learning in DNNs, a large number of training data and computation time are required [48,81]. To reduce the num- Corresponding author at: Department of Computer Science and Engineering, University of Quebec in Outaouais, Gatineau J8X 3X7, Quebec, Canada. E-mail address: [email protected] (M.S. Allili). ber of parameters, sparse models for DNNs have been proposed in [28,71]. These methods, although gaining computation efficiency at the expense of some accuracy, still require a huge number of data for representation learning. Recently, kernel-based methods have shown a considerable success in the literature of supervised classification [77]. They are based on implicit non-linear embedding of data into high- dimensional spaces using the kernel trick to enable linear sepa- ration of classes in the target space which translates into non- linear boundaries in the original space [86]. To reduce the effect of high-dimensionality, methods have been proposed to introduce sparsity in kernel-based classification. For instance, [97] proposed the import vector machine (IVM) method based on kernel logis- tic regression. In a similar way to SVM, IVM uses only a fraction of the training data to index the kernel basis functions. However, thanks to its probabilistic formulation, IVM is more easily extend- able to multi-class classification. Following the success of the SRC approach applied for face recognition [90], other methods were proposed to obtain sparse classifiers by minimizing reconstruc- tion errors with regard to class data [33,93,96]. These methods use mainly 1 -based regularization to achieve sparsity in model learning, but they are computationally intensive. Noteworthy are also methods using multiple kernel learning (MKL) for achieving better classification generalization [13,47,64]. Motivated by multi- resolution wavelet theory, these methods allow to learn complex class boundaries and induce sparsity accommodating fine details and large smooth class boundaries [4,47,50]. https://doi.org/10.1016/j.neucom.2017.10.024 0925-2312/© 2017 Elsevier B.V. All rights reserved.

Upload: others

Post on 30-Jun-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

Neurocomputing 275 (2018) 1752–1768

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Feature weighting for multinomial kernel logistic regression and

application to action recognition

Ouiza Ouyed

a , Mohand Said Allili a , b , ∗

a Department of Computer Science and Engineering, University of Quebec in Outaouais, Gatineau J8X 3X7, Quebec, Canada b Département d’informatique et d’ingénierie, Université du Québec en Outaouais 101, Rue St-Jean-Bosco, Local: B-2022, Gatineau J8X 3X7, Québec, Canada

a r t i c l e i n f o

Article history:

Received 13 January 2017

Revised 13 October 2017

Accepted 23 October 2017

Available online 7 November 2017

Communicated by Dr. Xin Luo

Keywords:

Multinomial kernel logistic regression

Feature relevance

Sparse models

Video action recognition

a b s t r a c t

Multinominal kernel logistic regression (MKLR) is a supervised classification method designed for sep-

arating classes with non-linear boundaries. However, it relies on the assumption that all features are

equally important, which may decrease classification performance when dealing with high-dimensional

and noisy data. We propose an approach for embedding feature relevance in multinomial kernel logistic

regression. Our approach, coined fr-MKLR, generalizes MKLR by introducing a feature weighting scheme

in the Gaussian kernel and using the so-called � 0 -“norm” as sparsity-promoting regularization. Therefore,

the contribution of each feature is tuned according to its relevance for classification which leads to more

generalizable and interpretable sparse models for classification. Application of our approach to several

standard datasets and video action recognition has provided very promising results compared to other

methods.

© 2017 Elsevier B.V. All rights reserved.

b

[

t

f

s

a

d

r

l

o

s

t

t

o

t

a

a

p

t

u

l

a

1. Introduction

Sparse models for classification have received an increasing at-

tention for supervised learning [41,59,60] . These include the least

absolute shrinkage and selection operator (LASSO) [72,83,94] , import

vector machine (IVM) [97] , sparse representation-based classification

(SRC) [33,93,96] and many others. Generally speaking, model spar-

sity in classifiers can come in two different flavours. First (instance

selection), sparsity can refer to selecting subsets (or instances) of

learning data that are representative for the classification at hand

[10,33,93,96] . This approach is used, for example, in support vector

machines (SVM) [86] and IVM [97] to produce compact sets of data

(i.e., support vectors) used for classification. In SRC [93,96] , classi-

fication is based on minimizing the reconstruction error with re-

gard to class data. Second (feature selection), sparsity refers to re-

ducing the feature space to a small subset of dimensions carrying

sufficient information for classification; this paper deals with this

type of sparsity. Worth mentioning are also methods for learning

representations using deep neural networks (DNNs), where rele-

vant features are automatically extracted for recognition problems

[8,29] . However, given the huge number of parameters required for

representation learning in DNNs, a large number of training data

and computation time are required [48,81] . To reduce the num-

∗ Corresponding author at: Department of Computer Science and Engineering,

University of Quebec in Outaouais, Gatineau J8X 3X7, Quebec, Canada.

E-mail address: [email protected] (M.S. Allili).

b

r

c

a

https://doi.org/10.1016/j.neucom.2017.10.024

0925-2312/© 2017 Elsevier B.V. All rights reserved.

er of parameters, sparse models for DNNs have been proposed in

28,71] . These methods, although gaining computation efficiency at

he expense of some accuracy, still require a huge number of data

or representation learning.

Recently, kernel-based methods have shown a considerable

uccess in the literature of supervised classification [77] . They

re based on implicit non-linear embedding of data into high-

imensional spaces using the kernel trick to enable linear sepa-

ation of classes in the target space which translates into non-

inear boundaries in the original space [86] . To reduce the effect

f high-dimensionality, methods have been proposed to introduce

parsity in kernel-based classification. For instance, [97] proposed

he import vector machine (IVM) method based on kernel logis-

ic regression. In a similar way to SVM, IVM uses only a fraction

f the training data to index the kernel basis functions. However,

hanks to its probabilistic formulation, IVM is more easily extend-

ble to multi-class classification. Following the success of the SRC

pproach applied for face recognition [90] , other methods were

roposed to obtain sparse classifiers by minimizing reconstruc-

ion errors with regard to class data [33,93,96] . These methods

se mainly � 1 -based regularization to achieve sparsity in model

earning, but they are computationally intensive. Noteworthy are

lso methods using multiple kernel learning (MKL) for achieving

etter classification generalization [13,47,64] . Motivated by multi-

esolution wavelet theory, these methods allow to learn complex

lass boundaries and induce sparsity accommodating fine details

nd large smooth class boundaries [4,47,50] .

Page 2: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1753

s

c

t

t

[

t

g

a

t

a

a

d

t

K

t

s

a

r

f

e

r

g

fi

t

t

B

a

m

c

s

w

o

w

s

m

r

t

w

2

p

h

s

l

fi

t

s

e

m

a

t

b

t

s

a

d

b

m

[

a

t

e

a

i

p

c

i

s

l

[

i

n

i

b

d

[

o

i

s

a

I

s

r

Most of the above methods embed sparsity in the kernel space

panned by the instances of the learning data. That is, sparsity

omes in term of data instances after implicit kernel transforma-

ion using the inner product. However, the effect of noisy fea-

ures of the original space on classification remains unaffected

33,64] . To alleviate this effect, some methods [15,91] have at-

empted to weight features in kernel spaces using the RELIEF al-

orithm [80] . Although these methods can efficiently rank features

ccording to their class discrimination, they do not give optimal

hresholds for assigning final class labels [19] . Other methods use

nisotropic kernels in SVM-based classifiers by weighting features

ccording to their relevance for classification. Feature weights are

etermined in these methods by direct optimization using evolu-

ionary algorithms [24,25] or gradient descent methods [16,27,76] .

ernel alignment is also used for efficient gradient-based optimiza-

ion in multi-scale kernel modeling [27,43,70] . However, since the

ame weights are assigned for all classes, these methods are more

dapted to binary classification. Indeed, in multi-class problems,

elevant features for discriminating one class may not be relevant

or discriminating other classes [35,36] .

In this paper, we propose the fr-MKLR approach (feature rel-

vance in multinomial kernel logistic regression) which incorpo-

ates feature relevance in non-linear and multi-class logistic re-

ression. Unlike SVM and MKLR, fr-MKLR produces sparse classi-

cation models by embedding feature relevance (FR) directly in

he kernel function and using an appropriate penalization term in

he likelihood function inducing feature sparsity in the final model.

eside formulating FR for each class data, fr-MKLR can deal with

rbitrary numbers of classes and dimensions of data and produce

ore interpretable classification models by selecting the most dis-

riminative features for each class data. Our contributions can be

ummarized in the following points:

• We propose an extension of multinomial kernel logistic regres-

sion (MKLR) by including a feature-based sparsity scheme al-

leviating the effect of noisy features and enhancing the capa-

bility of classification generalization. To the best of our knowl-

edge, sparse MKLR in terms of features has not been investi-

gated in the past. Similarly to MKLR, fr-MKLR produces an esti-

mate of the posterior probability for class data through the logit

regression which makes it easy to extend to multi-class data

[86,97] . However, it yields better classification generalization

than MKLR since the effect of irrelevant features is reduced.

• Contrarily to kernel methods ranking features based on ex-

plicit criteria such as the RIELIEF algorithm [15,91] , fr-MKLR is

an embedded method allowing a seamless integration of fea-

ture weighting in the logistic regression model. In addition,

it assigns different weights allowing to select the most dis-

criminative features for each class. Finally, as in methods us-

ing anisotropic kernels in SVM classifiers, either through multi-

scale kernel learning (MKL) [13,47,64] or direct feature weight-

ing [16,43,76] , fr-MKLR enables to fit smooth and fine de-

tailed class boundaries for better class discrimination. However,

having a probabilistic formulation which is readily extendable

for multi-class problems, fr-MKLR is more flexible than these

methods [16,24,65,76,89] .

• When comparing fr-MKLR to several methods such as LASSO,

KSVM and SMLR, fr-MKLR provides better generalization in case

of scarce and overlapping class data with non-linear bound-

aries. Our approach has been first validated on several syn-

thetic and real-world standard datasets and has demonstrated

more generalization performance than compared methods. We

applied fr-MKLR to human action recognition in videos. Action

recognition is a very challenging task due to high-dimensionally

and variability of video data, the diversity of human actions

and inter-class overlapping [1] . Our application is based on

new action representation using shape context analysis of hu-

man silhouettes. Validation of our approach on standard action

datasets (e.g., KTH, UIUC and I3d) has yielded better perfor-

mance to recent methods based on naive Bayes, SMLR, KSVM

and LASSO methods.

Early results of this work have been published in [66] . Herein,

e give an in-depth theoretical analysis and literature review for

ur approach. Moreover, extensive validation results are presented

ith comparison to other classification methods.

The rest of this paper is organized as follows. Section 2 de-

cribes some related work. Section 3 describes MKLR and our

odel background. Section 4 describes our approach for feature

elevance using MKLR. Section 5 provides a comparative evalua-

ion of our method to some existing approaches. We end the paper

ith a conclusion and future work perspectives.

. Related works

Sparse modeling (SM) has received an important focus for su-

ervised and unsupervised learning. For unsupervised learning, SM

as been used mainly for signal reconstruction [61] , compressed

ensing [14,21] and sparse coding [31] , whereas for supervised

earning, it has been used mainly for regression [83,84] and classi-

cation [62,88] . Since our work deals with classification, we limit

he scope of our related work to methods promoting sparsity in

upervised classification.

The literature of sparse models for classification starts with

arly approaches preforming feature selection (FS), which can be

ainly grouped into three categories. Filters select subsets of vari-

bles based on measures such as information gain [55,82] and sta-

istical correlation analysis [35] . They are usually fast but can be

iased since variables are chosen independently of the classifica-

ion model. Wrappers choose subsets of features using the clas-

ification model as a black box [49] . Although they are less bi-

sed than filters, wrappers are computationally intensive. Embed-

ed methods bridge the gap between filters and wrappers by em-

edding FS in the structure of the classifiers [36,42,54] . Embedded

ethods for FS can be roughly categorized into three main groups

52] . 1) forward-backward methods iteratively add/remove variables

ccording to specific criteria such as the rate of change of an objec-

ive function [34,39,54,65] or the sensitivity to the leave-one-out

rror [67] , 2) scaling factor methods use hyper-parameters that are

djusted by model selection [2,3,30] or by minimizing a general-

zation error bound [89] , and 3) direct optimization methods incor-

orate sparsity promoting terms based on the � 0 and � 1 norms that

ause irrelevant feature weights to vanish [38] . Among methods

n this category we can find the � 1 -regularized SVM [65,89] and

parse multinomial logistic regression (SMLR) [51] proposed for

inear and kernel-based classification.

Sparsity in linear classifiers has been investigated in SVM

12,89] and sparse multinomial logistic regression (SMLR) [51] us-

ng � 1 -based regularization. With the advent of kernel methods,

ew possibilities have risen for improving classification algorithms

n high-dimensional spaces [77] . Methods have attempted to com-

ine sparsity and kernel methods to achieve two types of data re-

uctions ( instance and feature selection). In a similar way to SVM

86] , Zhu and Hastie [97] have proposed the IVM algorithm based

n the MKLR formulation. The IVM uses a greedy search to select

nstances of data to build support vectors that provide better clas-

ification generalization. Though a better performance than MKLR

nd SVM is reported, the algorithm is computationally intensive.

n a similar way, Weston et al. [88] have used an approximate � 0 -

norm” to yield sparsity in SVM classification. This approach has

hown better performance than using SVM with � 1 - or � 2 -based

egularization.

Page 3: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1754 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

p

r

s

l

b

M

R

w

c

t

s

s

a

L

w

λ

t

t

1

t

w

t

b

L

w

K

K

p

e

o

a

t

l

w

r

t

a

c

[

N

T

f

L

Sparse representations for recognition can be obtained using

dictionary learning [23,33,93,96] and deep learning [5,29,44,46] .

Approaches based on dictionary learning approximate an input sig-

nal using sparse linear combinations of data instances after pro-

jection into kernel spaces. Class assignment is then performed by

minimizing the reconstruction error of newly observed data with

regard to labeled ones for each class [23] . This approach has been

used in [33] for action recognition in videos. Recently, representa-

tion learning using deep neural networks (DNN) has received much

interest for recognition problems [29] . Convolutional neural net-

works (CNNs) is one type of DNN that has shown a good promise

for object recognition in still images [20,48] . CNNs have also been

extended to 3D configurations for action recognition in videos

[5,17,44,46,68] . However, to ensure good success, these methods,

as for dictionary learning, require a huge number of training data.

In addition, DNNs is computationally intensive given the number

of layers and parameters involved in deep architectures [45,48] .

To ensure good spatial fitting of class boundaries, multiple ker-

nel learning (MKL) has been recently proposed for supervised clas-

sification [4,6,47] . MKL is one particular type of multiple ker-

nel methods [13,78] which combines kernels with different scales

through a multi-kernel learning framework. This enables better

classification generalization in spatially scattered and/or dense re-

gions in the leaning data. For MKL, a critical issue is to determine

the different kernel coefficients weighting the contribution of each

kernel function in the final decision boundary [6] . Methods us-

ing direct optimization have been proposed for MKL in SVM-based

classification [6,32,47] . However, increasing the number of kernel

functions can greatly increase the complexity of the optimization.

Another way to improve classification generalization is by using

anisotropic kernels to reduce the influence of irrelevant features

[16,24,25,43,76] . They consist in assigning different kernel weights

for the different dimensions of the data in order to contribute

each feature according to its discrimination capability. Optimiza-

tion techniques are used to fix each dimension weight based on

the learning data. However, most methods using this approaches

are limited to binary classification [43,70] .

Most of the sparse methods have been designed for binary

classification, and their formulation remains not easily extend-

able to multi-class problems. To overcome this limitation, we pro-

pose a sparse model (fr-MKLR) based on multinomial kernel logis-

tic regression (MKLR) which embeds feature relevance in kernel-

based classification. Contrarily to methods implementing spar-

sity in terms of kernel functions (i.e., MKL [6,32,47] ) or data in-

stances (i.e., dictionary learning [23,33,93,96] ), our method pro-

motes feature sparsity using anisotropic kernels fitting different

types of class boundaries. Given its probabilistic formulation, fr-

MKLR performs soft classification while remaining easily extend-

able to multi-class problems.

Contrarily to SMLR [51] weighting explicitly features in the orig-

inal or kernel spaces, our method embeds feature weighting di-

rectly in kernel construction and separately for each class data.

This enables to fit non-linear class boundaries according to each

class discriminative features. Unlike sparse representation learning

using dictionary or deep learning, our model does not require a

huge number of data for ensuring good efficiency. Moreover, by

performing direct optimization using a gradient-descent method,

our model is computationally more efficient. Several experiments

have demonstrated that our method compares favorably with re-

gard to other sparse methods for several classification problems.

3. Multinomial kernel logistic regression

Multinomial kernel logistic regression (MKLR) is a supervised

learning method that produces non-linear classification boundaries

by transforming an input variable space into another space using a

ositive-definite kernel K(., . ) . The relationship between SVM and

egularized function estimation in the reproducing kernel Hilbert

paces (RKHS) has been established in [22] . By replacing the hinge

oss function of SVM with the negative log-likelihood (NLL) of the

inomial distribution, the same relation can be established with

KLR [38,97] .

More specifically, let us have n instances of training data x i ∈

d , i ∈ { 1 , . . . , n } , with d measured features for each instance,

hich are generated from m classes ( m ≥ 2). We associate an en-

oding vector y i = [ y (1) i

, y (2) i

, . . . , y (m ) i

] T for each data point x i , such

hat y ( j) i

= 1 if x i belongs to the class j and y ( j) i

= 0 , otherwise. The

ymbol [ · ] T denotes transpose of vectors/matrices. For binary clas-

ification ( m = 2 ), we have y i ∈ {0, 1} and fitting a decision bound-

ry is equivalent to searching a function f minimizing the NLL [97] :

( f ) = −n ∑

i =1

y i f (x i ) + ln [ 1 + exp ( f (x i )) ] +

λ

2

‖ f ‖

2 H K

, (1)

here H K is the RKHS generated by the kernel K(., . ) and

controls the contribution of the regularization term ‖ f ‖ 2 H K

hat smoothes f . Note that the NLL (1) is obtained by set-

ing p(y i = 1 | x i ) = exp ( f (x i )) / [1 + exp ( f (x i ))] and p(y i = 0 | x i ) = / [1 + exp ( f (x i ))] . The optimal solution f ( x ) for minimizing (1) has

he form:

f (x ) =

n ∑

i =1

a i K(x , x i ) , (2)

here a i , i ∈ { 1 , . . . , n } , are real-valued coefficients. By defining the

wo vectors a = [ a 1 , . . . , a n ] T and y = [ y 1 , . . . , y n ]

T , function (1) can

e re-written in a compact form as follows:

(a ) = −y T Ka + 1

T ln

[1 + exp (Ka )

]+

λ

2

a T Ka , (3)

here 1 = [1 , 1 , . . . , 1] T is an n -dimensional vector of ones and

is a kernel matrix of dimension n × n with entries given by

r,s = K(x r , x s ) , r, s ∈ { 1 , . . . , n } . Also, we have adopted the com-

act form ln [ 1 + exp (Ka ) ] for [

ln (1 + exp (K (1 , . ) a )) , . . . , ln (1 +xp (K (n, . ) a ))

]T , with K ( i , .), i ∈ { 1 , . . . , n } , designating the i th row

f K .

In a multi-class setting ( m > 2), MKLR gives the posterior prob-

bility of class j given an observation x i which is written as p ( j) i

=p(y

( j) i

= 1 | x i ) . By defining a separate function f j ( x ) for each class j ,

he posterior probability of class j given x i can be written as fol-

ows:

p(y ( j) i

= 1 | x i ) =

exp ( f j (x i )) ∑ m

h =1 exp ( f h (x i )) , j = 1 , . . . , m, (4)

here f j (x i ) ∈ H K is defined as: f j (x ) =

∑ n i =1 a i j K(x , x i ) .

We put the coefficients of each function f j into a sepa-

ate vector a j = [ a 1 j , . . . , a n j ] T . Because

∑ m

j=1 p ( j) i

= 1 , we have

p (m ) i

= 1 − ∑ m −1 j=1 p

( j) i

. Thus, by setting a m

= 0 as for linear logis-

ic regression [51] , only the set of parameters A = { a 1 , . . . , a m −1 }re to be learned. For each data point x i , therefore, we asso-

iate a vector containing the class posterior probabilities p i = p (1)

i , p (2)

i , . . . , p (m −1)

i ] T defined as follows:

p ( j) i

= p(y ( j) i

= 1 | x i ) =

exp ( f j (x i ))

1 +

∑ m −1 h =1 exp ( f h (x i ))

, j =1 , . . . , m −1 . (5)

ote that p (m ) i

= 1 − ∑ m −1 j=1 p

( j) i

=

[ 1 +

∑ m −1 h =1

exp ( f h (x i )) ] −1

.

herefore, the multi-class penalized NLL can be formulated as

ollows (see Appendix A ):

(A ) =

m −1 ∑

j=1

−y ( j) T Ka j + 1

T ln

[

1 +

m −1 ∑

h =1

exp (Ka h )

]

+

λ

2

m −1 ∑

j=1

a T j Ka j ,

(6)

Page 4: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1755

w

4

n

e

t

p

f

t

i

w

t

I

t

K

w

w

s

4

d

f

K

w

c

a

M

t

A

c

i

t

(

=

t

i

t

f

w

[

t

O

t

s

f

L

w

K

m

a

w

o

s

v

t

o

l{w

1

H

m

{

n

s

w

a

t

t

m

H

F

s(

4

t

K

n

K

T

x

n

n

L

w

t⎧⎨⎩w

B

B

here y ( j) = [ y ( j) 1

, y ( j) 2

, ..., y ( j) n ] T .

. Feature relevance for MKLR

Feature relevance is motivated by the fact that a good combi-

ation of features usually leads to better classification than using

ach feature individually [34] . To deal with a large number of fea-

ures and improve the predictive performance of MKLR, we pro-

ose to directly weight features in the kernel of the radial basis

unction (RBF) of the MKLR. In other words, we use a weighted dis-

ance in the kernel function where each feature is scaled accord-

ng to its relevance to classification. Note that in contrast with [97] ,

here selection of data instances is performed for better classifica-

ion, our approach aims at achieving sparsity in terms of features.

n what follows, without loss of generality, we base our analysis on

he Gaussian kernel defined as:

(x r , x s ) = exp

(−‖ (x r − x s ) ‖

2 / (2 σ 2 ) ), (7)

ith r, s ∈ { 1 , . . . , n } and σ > 0 controls the width of the kernel. In

hat follows, we give the formulation of fr-MKLR for binary clas-

ification and then generalize it to multi-class data.

.1. Feature relevance in case (m = 2)

We use a weighting vector � = [ ψ 1 , . . . , ψ d ] T having the same

imension as our feature space and we plug it into the RBF (7) as

ollows:

˜ (x r , x s ) = exp

(−1

2

(x r − x s ) T diag (�) 2 (x r − x s )

), (8)

here diag( �) designates a diagonal matrix with diagonal entries

ontaining the elements of � . Note that in case all entries of �

re equal, ˜ K is isotropic and the model boils down to the standard

KLR. When the entries of � are different, ˜ K is anisotropic and

he contribution of each individual feature is weighted differently.

s a feature weight decreases to zero, its contribution to distance

alculation in (8) , and therefore to classification, will become less

mportant.

To encourage model sparsity, we propose to add a regulariza-

ion on the weight vector � to the negative log-likelihood function

3) using the � 0 -“norm” [88] . The � 0 -“norm” of � defined as ‖ �‖ 0 card { k | ψ k � = 0 , k = 1 , . . . , d} gives the number of non-zero en-

ries of the vector � . Note that, unlike � q norms with q > 0, ‖ · ‖ 0 s not a norm because the triangle inequality does not hold. Since

he � 0 -“norm” is not smooth, it is usually approximated using the

ollowing function [69] (see Fig. 1 for illustration):

�‖ 0 ≈d ∑

k =1

[ 1 − exp (−βψ k ) ] , (9)

here β is an approximation parameter. It has been shown in

12,88] that for sufficiently high values of β , classifiers can lead

o better generalization while maintaining a good model sparsity.

ur aim by using this penalty term is decreasing weights and con-

ribution of noisy features to classification. This can be achieved by

ubstituting the kernel ˜ K to K in function (3) and minimizing the

ollowing penalized NLL:

(a , �) = −y T ˜ K a + 1

T ln

[1 + exp ( ̃ K a )

]+

λ

2

a T ˜ K a + μd ∑

k =1

[ 1 − exp (−βψ k ) ] , (10)

here ˜ K is a kernel matrix of dimension n × n with entries˜ (x r , x s ) given by Eq. (8) and μ is a regularization parameter. This

inimization can be performed through an iterative process that

lternates between two steps until convergence. In the first step,

e estimate the entries of the vector a (for binary classification,

nly one vector a is estimated). In the second step, for a given

olution a , we minimize function (10) according to the weighting

ector � .

We use the Newton–Raphson (N–R) method to estimate the en-

ries of a and � . Using matrix differentiation rules [37] , the first

rder differential of (10) with respect to a and � is given as fol-

ows (see Appendix B ):

g = ∂ L /∂ a =

˜ K c

= ∂ L /∂ � = [ c T Q 1 a , . . . , c T Q d a ]

T + μβexp (−β�) , (11)

here c =

(− y + p +

λ2 a

)and p = [ p 1 , . . . , p n ]

T where p i = p(y i = | x i ) . We define the matrix Q k , k ∈ { 1 , . . . , d} , as the following

adamard product Q k =

˜ K ◦ B k , where B k is an n × n dimension

atrix with entries defined by B k (r, s ) = −ψ k (x r,k − x s,k ) 2 , r, s ∈

1 , . . . , n } . The final gradient vector of (10) is obtained by concate-

ating the terms of (11) which gives ˜ g =

[g T , T

]T .

To calculate the Hessian of function (10) , note that the Hes-

ian of (10) with respect to the entries of a is ˜ K W ̃

K + λ ˜ K ,

ith W = diag [ p 1 (1 − p 1 ) , p 2 (1 − p 2 ) , . . . , p n (1 − p n )] . We define

lso the second derivatives of the NLL with respect to the parame-

ers a and � as T =

∂ 2 L (a , �)

∂ �∂ �T and M =

∂ 2 L (a , �)

∂ �∂ a T (see Appendix C for

he details of calculation of T and M ). Therefore, the full Hessian

atrix of the NLL is given by:

˜ =

(˜ K W ̃

K + λ ˜ K M

M

T T

), (12)

inally, the N–R update is done using the following iterative

cheme:

a (t+1)

�(t+1)

)=

(a (t)

�(t)

)− ˜ H

−1 ˜ g (13)

.2. Feature relevance in case ( m > 2)

We generalize (10) to the multi-class case by associating a fea-

ure relevance vector �( j) = [ ψ

( j) 1

, ψ

( j) 2

, . . . , ψ

( j) d

] T for each class

j ∈ { 1 , . . . , m − 1 } . Thus, we associate a separate symmetric kernel˜

( j) for each class j encoding the class feature relevance. The ker-

el entries for a class j are calculated as follows:

˜

( j) (x r , x s ) = exp

(−1

2

(x r − x s ) T diag (�( j) ) 2 (x r − x s )

). (14)

he new posterior probabilities of the classes given an observation

i will be similar to those given in Eq. (10) by substituting the ker-

el ˜ K

( j) to K for each class j . Using the � 0 -“norm” penalization, the

ew NLL is given as follows:

(A , �) =

m −1 ∑

j=1

−y ( j) T ˜ K

( j) a ( j) + 1

T ln

[

1 +

m −1 ∑

h =1

exp ( ̃ K

(h ) a (h ) )

]

+

m −1 ∑

j=1

[

λ

2

a ( j) T ˜ K

( j) a ( j) + μd ∑

k =1

[1 − exp (−βψ

( j) k

) ]]

,

(15)

here A = { a (1) , . . . , a (m −1) } and � = { �(1) , . . . , �(m −1) } . Similarly

o Eq. (11) , we have ∀ j ∈ { 1 , . . . , m − 1 } , ∀ k ∈ { 1 , . . . , d} :

g

( j) = ∂ L /∂ a ( j) =

˜ K

( j) c ( j)

( j) = ∂ L /∂ �( j) = [ c ( j) T Q

( j) 1

a ( j) , . . . , c ( j) T Q

( j) d

a ( j) ] T

+ μβexp (−β�( j) ) ,

(16)

here we define c ( j) =

(− y ( j) + p

( j) +

λ2 a

( j) )

and Q

( j) k

=

˜ K

( j) ◦

( j) k

, with B

( j) k

is an n × n matrix having entries defined by

( j) k

(r, s ) = −ψ

( j) k

(x r,k − x s,k ) 2 . It follows that the gradient of L

Page 5: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1756 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

Fig. 1. Iso-contour plots of: (first row) � q -norm with different values of q , (second row) � 0 -norm approximation using Eq. (9) with different values of β .

Algorithm 1 Parameter estimation for fr-MKLR method.

Inputs :- Data set D = { (x 1 , y 1 ) , . . . , (x n , y n ) } . Output :- Parameter vectors (a ( j) , �( j) ) , j ∈ { 1 , . . . , m − 1 } .

�( j) ← �( j) (0)

; a ( j) ← a ( j) (0)

; t ← 1 ;

repeat

E ← 0 ;

for j = 1 → m − 1 do

Compute the gradient terms ∂ L /∂ a ( j) and ∂ L /∂ �( j) using

Eq. (16) ;

Compute the Hessian using using Eq. (19) ;

Update the entries of a ( j) (t)

and �( j) (t)

using Eq. (20) ;

E ← (E + ‖ a ( j) (t+1)

− a ( j) (t)

‖ + ‖ �( j) (t+1)

− �( j) (t)

‖ ) ; end for

t ← t + 1 ;

until ( E < ε OR t > MAXITER )

w

n

5

d

e

u

s

d

d

m

a

a

w

t

r

p

fi

with respect to the vectors a ( j ) ’s and �( j ) ’s will be given by ˜ g =[ g (1) T , . . . , g (m −1) T , (1) T , . . . , (m −1) T

] T .

To calculate the Hessian of function (15) , note that the Hes-

sian with respect to the elements of A is given by the matrix˜ K

∗W

∗ ˜ K

∗ + λ ˜ K

∗, where we define ˜ K

∗ = diag [ ̃ K

1 , . . . , ̃ K

(m −1) ] . The

operator diag[ · ] builds a matrix with diagonal blocks made of the

elements of the arguments. We define also the matrix W

∗ as fol-

lows:

W

∗ =

⎜ ⎜ ⎝

W 1 , 1 W 1 , 2 . . . W 1 ,m −1

W 2 , 1 W 2 , 2 . . . W 2 ,m −1

. . . . . .

. . . . . .

W m −1 , 1 W m −1 , 2 . . . W m −1 ,m −1

⎟ ⎟ ⎠

, (17)

with:

W j,� =

{diag [ p ( j)

1 (1 − p ( j)

1 ) , . . . , p ( j)

n (1 − p ( j) n )] i f j = �.

diag [ −p ( j) 1

p (� ) 1

, . . . , −p ( j) n p (� ) n ] i f j � = �.

(18)

Similarly to the case of binary clarification, we need also to cal-

culate matrices T ( j ) and M

( j ) for each class j , j ∈ { 1 , . . . , m − 1 } ,with elements defined as follows: T ( j) =

∂ 2 L ∂ �( j) ∂ �( j) T

and M

( j) =∂ 2 L

∂ a ( j) ∂ �( j) T . The full Hessian matrix with respect to all the parame-

ters is given as follows:

˜ H =

(˜ K

∗W

∗ ˜ K

∗ + λ ˜ K

∗ M

M

∗T T

), (19)

where we have M

∗ = diag [ M

(1) , . . . , M

(m −1) ] T and T ∗ =diag [ T (1) , . . . , T (m −1) ] T . Finally, the N–R update consists of the

following iterative formula: (˜ a (t+1)

˜ �(t+1)

)=

(˜ a (t)

˜ �(t)

)− ˜ H

−1 ˜ g . (20)

where ˜ a = [ a (1) T , a (2) T , . . . , a (m −1) T ] T and

˜ � =[ �(1) T , �(2) T , . . . , �(m −1) T ] T . Algorithm 1 shows the steps for

estimating the parameters of our model. The algorithm ends

hen the estimation reaches a certain precision ε or a maximum

umber of iterations MAXITER.

. Experiments

We have evaluated our method using simulated and standard

atasets as well as for human action recognition in videos. In each

xperiment, we used cross validation (CV) to determine the val-

es of the hyper-parameters λ, β and μ in function (15) and mea-

uring the performance of our method. More specifically, we ran-

omly generated five groups for learning and five groups for vali-

ation (or testing) for each dataset. Classification accuracy (CA) is

easured by averaging its values among the testing groups.

Let N l and N t be the sizes of a learning and testing data in

set containing N data points ( N = N l + N t ). The CA is calculated

s 1 −[

1 5

∑ 5 i =1

n (i ) l

N l

]for training and 1 −

[1 5

∑ 5 i =1

n (i ) t

N t

]for testing,

here n (i ) l

and n (i ) t are the numbers of badly classified points in

he learning and testing sets generated in the i th validation split,

espectively. Obtained results using our method fr-MKLR are com-

ared with MKLR, SMLR, KSVM, LASSO and the naive Bayes classi-

ers.

Page 6: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1757

Table 1

Used GMM parameters for generating the datasets of Tests I–VI, respectively.

Test # Data Class L j GMMs parameters

I N l = 160 j = 1 1 μ1 , 1 = [1 , 2] T , �1 , 1 = diag ([0 . 08 , 0 . 1]) .

N t = 20 0 0 j = 2 1 μ2 , 1 = [1 . 75 + τδ, 2] T , �2 , 1 = diag ([0 . 2 , 0 . 08]) .

j = 1 4 μ1 , 1 = [1 . 2 , 4] T , μ1 , 2 = [5 . 2 , 4] T , μ1 , 3 = [5 . 2 , 1 . 5] T , μ1 , 4 = [0 . 8 , 1 . 5] T ,

�1 , 1 = diag ([0 . 15 , 0 . 23]) , �1 , 2 = �1 , 3 = �1 , 4 = diag ([0 . 28 , 0 . 28]) ,

II N l = 20 –200 π1 , 1 = π1 , 2 = π1 , 3 = π1 , 4 = 0 . 25 .

N t = 20 0 0 j = 2 2 μ2 , 1 = [3 , 2] T , μ2 , 2 = [3 , 4 . 2] T , π2 , 1 = 0 . 90 , π2 , 2 = 0 . 10 .

�2 , 1 = diag ([0 . 18 , 0 . 33]) , �2 , 2 = diag ([0 . 11 , 0 . 15]) ,

j = 1 2 μ1 , 1 = [1 , 5 . 5] T , μ1 , 2 = [3 , 5 . 5] T , π1 , 1 = π1 , 2 = 0 . 5 .

III N l = 80 �1 , 1 =

(1 0 . 75

0 . 75 1

), �1 , 2 =

(1 − 0 . 75

−0 . 75 1

)N t = 1200 j = 2 2 μ2 , 1 = [1 , 8] T , μ2 , 2 = [3 , 8] T , π2 , 1 = π2 , 2 = 0 . 5 .

�2 , 1 = �1 , 1 , �2 , 2 = �1 , 2

j = 1 1 μ1 , 1 = [1 , 2] T , �1 , 1 = diag ([0 . 5 , 0 . 5]) .

IV N l = 75 j = 2 1 μ2 , 1 = [1 , 4] T , �2 , 1 = diag ([0 . 5 , 0 . 5]) .

N t = 30 0 0 j = 3 1 μ3 , 1 = [4 , 1] T , �3 , 1 = diag ([0 . 5 , 0 . 5]) .

j = 1 1 μ1 , 1 = [3 . 5 , 0 . 25] T , �1 , 1 = diag ([0 . 98 , 0 . 10]) .

j = 2 3 μ2 , 1 = [1 . 2 , 2] T , μ2 , 2 = [2 . 5 , 2] T , μ2 , 3 = [5 , 2] T

V N l = 240 �2 , 1 = diag ([0 . 2 , 0 . 18]) , �2 , 2 = �2 , 3 = diag ([0 . 18 , 0 . 10]) .

N t = 4800 j = 3 1 μ3 , 1 = [4 , 4 . 15] T , �3 , 1 = diag ([0 . 58 , 0 . 20])

j = 4 1 μ4 , 1 = [1 . 5 , 4 . 15] T , �4 , 1 = �3 , 1 .

j = 1 1 μ1 , 1 = [6 , 1 , 1 , 1] T , �1 , 1 = diag ([0 . 5 , 0 . 5 , 0 . 5 , 0 . 5]) .

N l = 50 –500 j = 2 1 μ2 , 1 = [6 , 1 . 5 , 3 . 5 , 1] T , �2 , 1 = �1 , 1 .

VI N t = 1250 j = 3 1 μ3 , 1 = [6 , 1 . 5 , 1 , 3 . 5] T , �3 , 1 = �1 , 1 .

j = 4 2 μ4 , 1 = [1 , 1 , 1 , 1] T , μ4 , 2 = [1 , 3 . 25 , 1 , 1] T .

�4 , 1 = �4 , 2 = �1 , 1 , �1 , 2 , π4 , 1 = π4 , 2 = 0 . 5 .

j = 1 1 μ1 , 1 = (0 , 4 , −5 , 10 , 0 , ..., 0) T ;σ1 , 1 = (1 , 1 , 1 , 2 , 3 , 1 , 1 , 5 , 5 , ..., 5) T ;

N l = 100 (Dots mean the same as the previous dimension value).

VII N t = 100 j = 2 1 μ2 , 1 = (5 , 11 , −2 , 20 , 25 , 10 , 7 , 3 , ..., 3) T ;σ2 , 1 = σ1 , 1 ;

(Dots mean the same as the previous dimension value).

5

5

.1. Data description and presentation

1) Simulated data: We have used finite Gaussian mixture models

(GMMs) to generate data for 7 tests (Test I–VII), with some fea-

tures are purposefully set to have no discrimination between

classes. For each test, each class distribution has the following

general form (values of parameters are given in Table 1 ):

p(x | y ( j) = 1) =

L j ∑

k =1

π j,k p(x | μ j,k , � j,k ) , j ∈ { 1 , . . . , m } , (21)

where L j is the number of components of the GMM generating

class j of the data, π j, k , μj, k and �j, k are the a priori probabil-

ity, the mean vector and covariance matrix of the k -th compo-

nent of the mixture, k ∈ { 1 , . . . , L j } . Test I–III are conducted for

binary classification and show the generalization capability of

our method with different scenarios of class data:

• Test I ( overlapping classes ): data contain two overlapping

classes where only one dimension is relevant for classifica-

tion. Each class data have been generated using a bivariate

Gaussian. We vary the amount of class overlapping by shift-

ing the mean of the first class in one dimension by incre-

ments τ ∈ { 1 , . . . , 10 } of a step δ = 0 . 25 .

• Test II ( scarce learning data ): shows the generalization capa-

bility of our algorithm when learning data are scarce. Each

class has been generated using a mixture of bivariate Gaus-

sians. We vary the number of learning data per Gaussian N g

from 10 to 100.

• Test III ( multi-modal classes ): classes are multimodal and

separated by non-linear boundaries. Each class has been

generated using a mixture of bivariate Gaussians.

We then conducted three other tests (Tests IV–VI) using multi-

class data (the parameters of the tests are given in Table 1 ).

Tests IV and V use data with number of classes m = 3 and

m = 4 , respectively. The data of each class have been generated

using mixtures of bivariate Gaussians. Test VI uses data with

m = 4 and d = 4 where d is the dimensionality of the data. Pro-

jection of the data on two dimensions is shown in Fig. 2 .

Finally, to demonstrate the effectiveness of using the � 0 norm,

we conducted an illustrative test of binary classification (Test

VII) using a simulated dataset of 25 features and N = 200 data,

where only 6 features have a clear discrimination between the

two classes (see Fig. 3 ).

2) UCI real-world data: The datasets used in this experiments are

taken from the UCI machine learning repository [57] . Tested

datasets include Banana (BN), Breast cancer (BC), Diabetes (DB),

Heart (HR), Flare-solar (FS), German (GR), Ringnorm (RG), Thyroid

(TH) and Twonorm (TW) for binary classification, and EMG phys-

ical action (EMG), wine quality (WQ), Ecoli (EC) and Image seg-

mentation (SEG) for multi-class classification. The description of

the UCI datasets is given in Table 2 . Note that, unlike our syn-

thetic data, the class data of the UCI datasets are not necessarily

Gaussian.

.2. Numerical results and comparisons

1) Case of binary classification ( m = 2 ): In Figs. 4 and 5 , the first

to last rows show class boundaries obtained in Test I–III using

MKRL, fr-MKLR and KSVM methods, respectively. We can see

that fr-MKLR has clearly succeeded for both tests in selecting

the best separating feature which led to better generalization

than the other methods. Fig. 6 shows CA values obtained for

both Test I and II tests (by varying class overlapping for Test I

and the number of training data in Test II). Table 3 (first row)

shows the results for Test III. We can observe that fr-MKLR out-

performed the MKLR and KSVM methods in the three tests.

2) Case of multi-class classification ( m > 2): The resulting classifi-

cation boundaries obtained using MKRL, fr-MKLR and KSVM are

shown in Figs. 7 and 8 for Tests IV and V and corresponding CA

are given in Table 3 . Clearly, fr-MKLR has succeeded in selecting

the best separating features for each class, which led to a better

generalization than the other methods. For Test VI, CA values

Page 7: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1758 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

Fig. 2. 2D projection of data of Test VI.

Page 8: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1759

Fig. 3. Histograms of the 25 features used in Test VII: red (continuous) and blue (dashed) lines show feature distributions in classes j = 1 and j = 2 , respectively. (For

interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).

Table 2

Description of used UCI datasets in our experiments.

Datasets BN BC DB HR FS GR RG TH TW EMG WQ EC SEG

# of instances ( N ) 5300 699 768 270 144 10 0 0 7400 250 7400 10 4 178 336 2310

# of attributes ( d ) 2 10 8 13 9 24 20 5 20 8 13 8 19

# of classes ( m ) 2 2 2 2 2 2 2 2 2 20 3 8 7

Table 3

Average classification accuracy obtained by the compared methods (Standard deviation in brackets).

MKLR fr-MKLR – ( � 1 ) – ( � 2 ) SMLR KSVM Naive Bayes LASSO

Test III 95.10 (3.2) 95.60 (1.4) 93.75 (1.2) 93.50 (1.5) 94.16 93.33 (1.2) 89.19 (2.1) 83.33 (8)

Test IV 94.89 (2.7) 97.26 (2.1) 93.40 (2.3) 92.87 (6.5) 94.21 (2.6) 91.19 (2.6) 94.30 (3.2) 66.56 (0.1)

Test V 91.73 (3.1) 94.33 (2) 91.21 (2.7) 89.77 (4.2) 85.93 (3.9) 94 (1.4) 96.39 (1.2) 77.17 (0.6)

using learning and testing data are given in Fig. 6 by varying N l

from 50 to 500. We can note that using learning data for vali-

dation, MKLR and KSVM had better performance than the other

methods. When using testing data for validation, fr-MKLR and

Naive Bayes lead to better performance that the other methods.

Moreover, for N l ≥ 20, fr-MKLR has performed better than Naive

Bayes. This shows that in the presence of high-dimensional and

scarce learning data, fr-MKLR has tendency to provide better

generalization. Finally, obtained results for Test VI are shown

in Fig. 6 and Table 3 , respectively, where, again, fr-MKLR has

yielded batter performance than compared methods.

To show the effectiveness of using the � 0 norm, we created two

other versions of fr-MKLR by replacing � 0 norm which induces

sparsity by the � 1 and � 2 norms, respectively. Comparison re-

sults between the fr-MKLR versions using the different scores

are shown for Tests I and II and VI in Fig. 6 and in Table 3 for

Tests III–V. For all tests, using the � 0 norm has yielded better

results than the other norms. Finally, Fig. 9 presents a graph

Page 9: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1760 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

Fig. 4. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test I data (first row), Test II data (second row) and Test III data (third row),

respectively. For each method, a 2D scatter is shown using learning data.

Table 4

Comparaison of LOGREG-LASSO, fr-MKLR and SMLR for UCI datasets repository

(Standard deviation in brackets).

Dataset LOGREG-LASSO [72] (%) fr-MKLR (%) SMLR [51] (%)

Banana 89.3 (0.5) 91.52 (0.3) 81.66 (5.6)

Breast cancer 73.9 (4.6) 91.02 (3.7) 91 (1.2)

Diabetes 76.5 (1.9) 78.51 (1.3) 76 (3.6)

Heart 84 (3.1) 86.07 (2.4) 81.1 (8.2)

Flare-solar 66.73 (1.6) 70.15 (1) 55.43 (8)

German 76.4 (2.3) 85 (1.6) 74.5 (5.4)

Ringnorm 98.2 (0.3) 98.9 (0.2) 96.3 (3.4)

Thyroid 95.2 (2.3) 94.23 (1.4) 88.89 (6.5)

twonorm 97.4 (0.2) 98.6 (0.2) 94.2 (3)

showing the number of retained relevant features (i.e., with

weight ψ k > 0) as a function of the number of iterations in Test

VII. Clearly, � 0 norm has allowed to isolate the exact number of

relevant features and with a smaller number of iterations than

� 1 and � 2 . All these tests demonstrate the effectiveness of using

the � 0 for obtaining good sparse models for classification.

3) Results for the UCI datastes: In Tables 4 and 5 , we present

comparative results obtained for the different UCI datasets. We

first compared our method with LOGREG-LASSO [72] and SMLR

[51] methods on several datasets. Obtained results are shown

in Table 4 , where we used the classification accuracy averaged

over 100 training/test splits as suggested in the compared pa-

per. We note that results of [72] are taken directly as reported

by the authors of the paper. We can see that, except for the

thyroid dataset, fr-MKLR gives sensibly better results than us-

Page 10: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1761

Fig. 5. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test I data (first row), Test II data (second row) and Test III data (third row),

respectively. For each method, a 2D scatter is shown using testing data.

Table 5

Average classification accuracy obtained by the compared methods for UCI datasets repository (Standard deviation in brack-

ets).

EMG action(M > 2) EMG action(M = 2) Wine quality Ecoli Image segmentation

MKLR 68.65(2.1) 65.74 (2.2) 94.44 (1.6) 82.94 (2.6) 84.35 (2.2)

fr-MKLR( μ = 0 . 05 ) 68.38(2.5) 65.36 (2.2) 94.44(1.6) 83 (2.9) 84.35 (2.3)

– ( μ = 2 . 5 ) 72.52(0.8) 96.23 (1.6) 96.83 (0.9) 84.71 (2.5) 86.39 (1.6)

– ( μ = 4 . 5 ) 68.20(2.5) 91.35 (2.1) 93.52 (1.8) 82.9 (3.3) 83.67(2.8)

– ( � 1 ) 63.50 (8.8) 95.62 (1.4) 94.44 (2.4) 71.55 (3.4) 85.71 (1.9)

– ( � 2 ) 62.77 (7.5) 94.53 (5.8) 92.86 (2) 70.71 (20) 83.67 (3.8)

SMLR 68.45(0.9) 90.03 (0.9) 96.56 (4) 83.52 (8.6) 83.54 (8.4)

KSVM 68.98 (2.5) 95.50 (0.7) 94.44 (1.8) 80.33 (2.5) 72.79 (3.8)

Naive Bayes 85.90 (3.3) 93.67 (1.2) 93.65 (2.4) 66.10 (7.6) 78.91 (3.7)

LASSO 62,64 (1.6) – 62.17 (8.6) 54.81 (5.9) 64.24 (3.5)

Page 11: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1762 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

Fig. 6. Classification accuracy (CA) obtained in Tests I and II and VI, respectively. For each test, left figures shows comparison of fr-MKLR with MKLR, NB, LASSO, SMLR and

KSVM, and right figure shows comparison of fr-MKLR implementations using � 0 , � 1 and � 2 norms, respectively.

ing LOGREG-LASSO and SMLR. The prediction differences can be

explained by the nature of the datasets and the structure of the

classification models. In addiction of promoting feature sparsity,

fr-MKLR has flexibility to fit better non-linear class boundaries

characterizing in our case the majority of used datasets. The re-

sults of LOGREG-LASSO and SMLR can be explained by the lin-

ear structure of LOGREG-LASSO and the use of component-wise

procedure to select features in the case of SMLR which provides

less efficiency than fr-MKLR on these datasets.

In Table 5 , comparison of our method with MKLR, SMLR, KSVM,

NB and LASSO are presented is presented for other UCI datasets.

We also show results for different versions of fr-MKLR using

different norms inducing sparsity and values for the sparsity

coefficient μ. For the sparsity coefficient, three values are tested

μ∈ {0.05, 2.5, 4.5}, among which the value μ = 2 . 5 has been

obtained by cross-validation. Note that when μ = 0 . 05 , almost

the same results as MKLR are obtained for all datasets. When

μ = 4 . 5 , the performance of fr-MKLR decreases as shown in

the table. Clearly, our model fr-MKLR, with μ = 2 . 5 obtained

by cross-validation, has yielded the best results compared to

other methods. These results also demonstrate the ability of our

method to perform well when dealing with non-Gaussian data.

Finally, to show the significance of performance improvement

of our method with regard to compared ones, we used the

Wilcoxon test [18] with confidence level α = 0.05 and using

different numbers of datasets N = (7, 8, 9, 10, 17) from our UCI

and synthetic datasets. In what follows, we show the critical

values T for the statistic T for each number of datasets as

N
Page 12: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1763

Fig. 7. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test IV data (first row) and Test V data (second row) respectively. For each

method, a 2D scatter is shown using learning data.

5

n

M

o

[

o

r

o

p

p

a

t

p

[

i

e

q

g

v

t

e

c

r

u

6

W

d

[

a

d

K

well as the observed values of T when comparing algorithms

using these datasets: T N=7 = 2 ( T LASSO = 0 ), T N=8 = 4 ( T KSV M

=0 , T f r−MKLR −l1 = 0 , T f r−MKLR −l2 = 0 and T MKLR = 0 ), T N=9 = 6

( T Logreg = 2 ), T N=10 = 8 ( T NB = 8 ) and T N=17 = 35 ( T SMLR = 0 ). We

can note that the observed values of T are always equal or

less than the critical values. Therefore, we can conclude that fr-

MKLR has significant performance improvement than the com-

pared methods.

.3. Application to video action recognition

Several methods in the past have cast the human action recog-

ition as a classification problem in high-dimensional spaces [1] .

ethods have used classifiers on video descriptors such as KSVM

n interest points [53] , KNN on optical flow (HOF) Optical flow

92] and dense trajectories + SVM [56] for action description. Most

f these descriptions are generally very high-dimensional, whereas

elevant information for discrimination may lie only in a handful

f dimensions.

Motivated by the obtained performance for fr-MKLR, we pro-

ose to use our model for action recognition in videos by ex-

loiting sparsity to identify useful dimensions while operating in

multi-class setting. To capture local and global information of ac-

ions which is necessary for better action recognition [9] , we pro-

ose a representation based on the shape context descriptor (SCD)

7,66] . To calculate the SCD, we first exact video foregrounds us-

ng the algorithm proposed in [11] . Then, a shape histogram of 128

ntries (i.e, bins) is calculated for each silhouette through the se-

uence. Contrarily to [7] , we use one reference point which is the

ravity center of the silhouette and each bin is quantified to the

alues 1 or 0 meaning the presence/absence of the object con-

our in the bin (see Fig. 10 for illustration). The final descriptor of

ach action consists of the mean and standard deviation of the SCD

omputed through the sequence. Finally, to ensure high-level rep-

esentation of actions, we perform first dimensionality reduction

sing restricted Boltzmann machines (RBM) [73] . The RBM uses

4 hidden units and the 128 shape context entries as input data.

e have evaluated our method for action recognition using three

atasets: KTH [53] , UIUC [59] and the I3DPost multi-view database

26] . Each dataset contain single person performing basic action in

controlled environment, examples of actions are given in Fig. 11 .

• KTH contains 6 type of actions ( walking, jogging, running, box-

ing, hand waving, hand clapping ) performed several times by 25

subjects in 4 different scenarios: outdoors, outdoors with scale

variation, outdoors with different clothes , and indoors , the total

of 600 videos.

• UIUC consists of 14 actions ( walking, running, jumping, waving,

jumping jacks, clapping, jumping from sit up, raising one hand,

stretching out, turning,sitting to standing, crawling, pushing up,

standing to sitting ) performed by 8 subjects in total it contain

532 videos.

• I3DPost contains multi-view actions of 768 videos captured

with 8 cameras and performed by 8 subjects (2 females and

6 males) for 12 actions ( bend, hand wave, jump, jump in place,

run, walk, run-fall, run-jump-walk, sit-stand-up, walk-sit, hand-

shake, pull ). Note that since we aim to classify only single per-

son action, we removed the last two actions ( handshake , and

pull ) from our tests.

Table 6 gives average values of CA obtained using the three

atasets for the compared methods. These include fr-MKLR, MKLR,

SVM, and LASSO using our action description based on SCD and

Page 13: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1764 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

Fig. 8. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test IV data (first row) and Test V data (second row) respectively. For each

method, a 2D scatter is shown using testing data.

Fig. 9. Graph showing the number of retained relevant features for Test VII as a

function of number of iterations.

Fig. 10. Distribution of bins for the SCD used for action representation.

a

t

n

d

d

f

a

m

d

t

u

d

the baseline and recent methods for each dataset, namely interest

points + KSVM [53] , interest points + pLSA and LDA [63] , space-

time features + SVM [74] , Optical flow + KNN [92] , local space-time

features + part-based model + multi-task learning [60] , parame-

terized representation + discriminative classifiers [95] , dense tra-

jectories + SVM [56] and recent results using convolutional neural

networks [17,46,68] . Optical flow + Correlated Topic Model (CTM)

[85] , spatiotemporal volumes + KNN [59] , dense trajectories + mo-

tion boundary histogram + SVM [87] and depth map + skeleton

structure + multi-kernel learning [58] for UIUC. For i3DPost multi-

view dataset, comparison is made with reported results in [40] us-

ing 3D motion context, [79] using motion features and SVM for

ction recognition and [26] using only 5 among the 12 actions of

he dataset.

Results of the experiments are shown on Table 6 . We can

ote that fr-MKLR outperforms all methods for the KTH and UIUC

atasets. For KTH, authors in [56,60,95] have improved the video

escriptors to boost the performance of action recognition. There-

ore, they obtained better performance than fr-MKLR. However, by

pplying the RBM, we obtained the best performance among all

ethods. The same performance has been obtained for the UIUC

ataset. For i3DPost dataset, [79] has obtained higher performance

han our method. This is partly due to the quality of the features

sed in [79] which take into account the multiview setting of the

ataset. Indeed, 3D motion context (MC) using all views informa-

Page 14: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1765

Fig. 11. Examples of actions of UIUC , i3DPost and KTH datasets.

Table 6

Average classification accuracy obtained by compared methods for three data base: KTH(6 classes),

UIUC(14 classes) and i3dpost(10 classes), (-) means values not reported in the original papers.

Datasets Methods Testing accuracy (%)

KTH Banerjee et al. [5] 90

Charalampous et al. [17] 91.9

Ji et al. [44] 90.2

Kai et al. [46] 88.7

Laptev et al. [53] 91.8

Li et al. [56] 97.6

Liu et al. [60] 94.3

Niebles et al. [63] 81.5

Ravanbakhsh et al. [68] 94.1

Shuldt et al. [74] 71.7

Yang et al. [92] 75

Yuan et al. [95] 96.3

LASSO 40

KSVM 76.6

MKLR 87.7

RBM + MKLR 98.5

fr-MKLR 93.3

RBM + fr-MKLR 98.6

UIUC Hong et al. [85] 93.3

Lin et al. [58] 98.7

Liu et al. [59] 93.5

Wang et al. [87] 97.1

LASSO 61

KSVM 92.2

MKLR 95.6

RBM + MKLR 96.2

fr-MKLR 98.8

RBM + fr-MKLR 99.5

I3d actions Ghalelis et al. [26] (5 actions) 90

Holte et al. [40] 3D - MC 80

– 3D - MC - mean 77.5

– HMC 76.2

– HMC - mean 68.7

Spurlock et al. [79] RT 73.7

– MC 96.2

LASSO 67.2

KSVM 72

MKLR 68.8

RBM + MKLR 88.6

fr-MKLR 77.6

RBM + fr-MKLR 90

Page 15: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1766 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

A

L

B

f

l

w

E

A

w

t

Q

T

i

A

t

H

w

t

tion in [40] and MC with discriminative views in [79] perform well

than using the same features for all the views. Nonetheless, the ap-

plication of RBM has improved considerably our results. Finally, we

must put on emphasis the performance gap between RBM+MKLR

and RBM+fr-MKLR. Indeed, the majority of deep learning methods

apply softmax functions (e.g., MKLR) on the top layer of the net-

work for classification. Given the obtained performance by fr-MKLR

over MKLR, our method constitutes a good alternative for softmax

functions in classification problems using DNNs.

5.4. Computational analysis

Since most of the time is taken by model training, we discuss

the computational time induced by the training step. Having N l

data points, distance calculation for each kernel matrix will require

N l (N l − 1) / 2 steps which can be performed in parallel. The calcu-

lation of gradient and Hessian terms using Eqs. (16) and (19) has

a linear computational complexity ∼ O ( mN l ). The NLL minimiza-

tion is performed iteratively using Eq. (20) . Therefore, the compu-

tational complexity induced by a single iteration is approximately

∼ O ([ m (d + N l )] 2 . 8 ) since it involves matrix inversion.

Knowing that we deal usually with scarce learning data, com-

putational time of the above steps can be significantly reduced

using newly-developed computer hardware. For example, matrix

inversion can be reduced to nearly linear complexity using the

method proposed in [75] . We have used the MATLAB platform on

a PC with Intel(R) core(TM) i7-3920XM at 2.9 GHz CPU to run our

experiments and compared average execution time including both

learning and testing phases. The following values have been ob-

tained: 1.68 s for fr-MKLR, 1.17 s for MKLR, 0.52 s for LASSO, 0.28 s

for Naive Bayes and 38.4 s for KSVM. We can note that, fr-MKLR

and MKLR have almost similar execution times even if fr-MKLR

has an additional calculation step. LASSO and Naive Bayes require

lesser computation time use simple probability calculation. KSVM

is the slowest algorithm because of the one versus all process used

to deal with the multi-class case. This allows us to conclude that

our approach does not add much burden to classification in term

of computation time.

6. Conclusions and discussion

We have presented the fr-MKLR method incorporating feature

relevance in multinomial kernel logistic regression. It consists of

using anisotropic kernels by embedding weights controlling fea-

tures contribution to classification. Obtained models are sparse

and enable better classification generalization than several stan-

dard methods such as naive Bayes, MKLR, SMLR, KSVM and LASSO.

We have applied our method to binary and multi-class data classi-

fication using simulated and standard datasets as well as on an ap-

plication to video action recognition. Experiments have shown that

the proposed approach outperforms compared methods in cases of

non-linear class boundaries, scarce training data and the presence

of redundant features. However, as for any kernel-based method,

when the number of training data is very large, computational

time becomes a limitation. Also, for linearly separable data (e.g.,

text classification), fr-MKLR and its competitors MKLR and KSVM,

can be less efficient than their linear counterparts.

Acknowledgment

This work has been completed with the support of the Natural

Sciences and Engineering Research Council of Canada (NSERC).

ppendix A. Derivation of the NLL in Eq. (6)

Using the posterior probabilities in Eq. (5) , we have:

(A ) = −n ∑

i =1

(

m −1 ∑

j=1

y ( j) i

ln

[exp ( f j (x i ))

1 +

∑ m −1 h =1 exp ( f h (x i ))

]

+

(

1 −m −1 ∑

j=1

y ( j) i

)

ln

[1

1 +

∑ m −1 h =1 exp ( f h (x i ))

])

+

λ

2

m −1 ∑

j=1

a ( j) T Ka ( j) (22)

=

n ∑

i =1

(

m −1 ∑

j=1

−y ( j) i

f j (x i ) +

λ

2

a ( j) T Ka ( j)

+ ln

[

1 +

m −1 ∑

h =1

exp ( f h (x i ))

] )

, (23)

y defining y ( j) = [ y ( j) 1

, y ( j) 2

, . . . , y ( j) n ] T and using the compact

orm:

n

[

1 +

m −1 ∑

h =1

exp (Ka (h ) )

]

=

⎜ ⎜ ⎜ ⎝

ln

[1 +

∑ m −1 h =1 exp (K (1 , . ) a (h ) )

]ln

[1 +

∑ m −1 h =1 exp (K (2 , . ) a (h ) )

]. . .

ln

[1 +

∑ m −1 h =1 exp (K (n, . ) a (h ) )

]

⎟ ⎟ ⎟ ⎠

,

(24)

here K ( i , .), i ∈ { 1 , . . . , n } , denotes the i th row of K , we obtain

q. (6) .

ppendix B. Derivation of the terms of Eq. (11)

Given the following derivation:

∂ ln

[1 + exp ( ̃ K (i, . ) a )

]∂a

=

exp ( ̃ K (i, . ) a )

1 + exp ( ̃ K (i, . ) a ) ˜ K (i, . ) T = p i ̃ K (., i ) ,

(25)

e have: ∂ L /∂ a = (− ˜ K y +

˜ K p + λ ˜ K a ) =

˜ K c . Using Eq. (8) and ma-

rix derivation properties, we have ∀ k ∈ { 1 , . . . , d} :

k =

∂ ̃ K

∂ψ k

=

⎜ ⎜ ⎜ ⎜ ⎝

∂ ˜ K (x 1 , x 2 )

∂ψ k

. . . ∂ ˜ K (x 1 , x n )

∂ψ k . . .

. . . . . .

∂ ˜ K (x n , x 1 )

∂ψ k

. . . ∂ ˜ K (x n , x n )

∂ψ k

⎟ ⎟ ⎟ ⎟ ⎠

=

˜ K ◦ B k . (26)

herefore, we have: ∂ L /∂ ψ k = c T Q k a + μβexp (−βψ k ) , where B k

s the matrix defined in Section 4.1 . By gathering the elements

L /∂ ψ k in one vector, we obtain the second line of Eq. (11) .

ppendix C. Derivation of Eq. (12)

First, we have ∂ p i /∂ a = p i (1 − p i ) ̃ K (i, . ) . Then, the Hessian of

he NLL with respect to the elements of a is given by:

= (KWK + λ ˜ K ) , (27)

here W = diag [ p 1 (1 − p 1 ) , p 2 (1 − p 2 ) , . . . , p n (1 − p n )] . Also, note

hat ∀ k, � ∈ { 1 , . . . , d} and ∀ i ∈ { 1 , . . . , n } :

Page 16: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1767

T

d

T

M

w

D

a

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

∂ 2 ˜ K

∂ ψ k ∂ ψ �

=

⎜ ⎜ ⎜ ⎜ ⎝

∂ 2 ˜ K (x 1 , x 2 )

∂ ψ k ∂ ψ �

. . . ∂ 2 ˜ K (x 1 , x n )

∂ ψ k ∂ ψ �

. . . . . .

. . .

∂ 2 ˜ K (x n , x 1 )

∂ ψ k ∂ ψ �

. . . ∂ 2 ˜ K (x n , x n )

∂ ψ k ∂ ψ �

⎟ ⎟ ⎟ ⎟ ⎠

, (28)

∂ 2 ˜ K

∂ a i ∂ ψ k

=

⎜ ⎜ ⎜ ⎜ ⎝

∂ 2 ˜ K (x 1 , x 2 )

∂ a i ∂ ψ k

. . . ∂ 2 ˜ K (x 1 , x n )

∂ a i ∂ ψ k . . .

. . . . . .

∂ 2 ˜ K (x n , x 1 )

∂ a i ∂ ψ k

. . . ∂ 2 ˜ K (x n , x n )

∂ a i ∂ ψ k

⎟ ⎟ ⎟ ⎟ ⎠

. (29)

herefore, we can build the matrices T and M containing the mixed

erivatives as follows:

k� =

∂ 2 L

∂ ψ k ∂ ψ �

=

{[ Q k Wa ] T Q k a + c T S k a − μβ2 exp (−βψ k ) i f k = �

[ Q � Wa ] T Q k a + c T (Q k ◦ B � ) a i f k � = �

(30)

ik =

∂ 2 L

∂ a i ∂ ψ k

= Q k (i, . )(−y + p + λa ) +

˜ K (i, . ) Q k Wa (31)

here S k =

˜ K ◦ (D k + B k ◦ B k ) and D k is an n × n matrix where

k (r, s ) = −(x r,k − x s,k ) 2 . By putting together the elements of H, M

nd T , we obtain the Hessian of Eq. (12) .

eferences

[1] J.K. Aggarwal, M.S. Ryoo, Human activity analysis: a review, ACM Comput. Surv.43 (3) (2011) . https://dl.acm.org/citation.cfm?id=1922653 . Article 16.

[2] M.S. Allili , S. Bacha , Feature Relevance in Bayesian Network Classifiers and Ap-

plication to Image Event Recognition, FLAIRS Conference (2017) 760–763 . [3] M.S. Allili , D. Ziou , Likelihood-based feature relevance for figure-ground seg-

mentation in images and videos, Neurocomputing 167 (2015) 658–670 . [4] F. Bach , Consistency of the group lasso and multiple kernel learning, J. Mach.

Learn. Res. 9 (2008) 1179–1225 . [5] B. Banerjee, V. Murino, Efficient pooling of image based CNN features for ac-

tion recognition in videos, Proceedings of IEEE International Conference on

Acoustics, Speech and Signal Processing (2017) 2637–2641. [6] J. Baoa, Y. Chena, L. Yub, C. Chena, A multi-scale kernek learning method and

its application in image classfication, Neurocomputing (2017), doi: 10.1016/j.neucom.2016.11.069 .

[7] S. Belongie , J. Malik , J. Puzicha , Shape matching and object recognition usingshape contexts, IEEE Trans. Pattern Anal. Mach. Intell. 24 (4) (2002) 509–522 .

[8] Y. Bengio , A. Courville , P. Vincent , Representation learning: a review and new

perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828 . [9] R. Blake , M. Shiffrar , Perception of human motion, Annu. Rev. Psychol. 58 (1)

(2007) 47–73 . [10] A. Blum , P. Langley , Selection of relevant features and examples in machine

learning, Artif. Intell. 97 (1-2) (1997) 245–271 . [11] A. Boulmerka, M.S. Allili, Background modeling in videos revisited using finite

mixtures of generalized Gaussians and spatial information, Proceedings of IEEE

International Conference on Image Processing (2017) pp. 3660–3664. [12] P. Bradley, O. Mangasarian, Features selection via concave minimization and

support vector machine, Proceedings of International Conference on MachineLearning (1998) 82–90.

[13] S.S. Bucak , R. Jin , A.K. Jain , Multiple kernel learning for visual object recogni-tion: a review, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2014) 1354–1369 .

[14] E.J. Candès , M.B. Wakin , An introduction to compressive sampling, IEEE Signal

Process. Mag. 25 (2) (2008) 21–30 . [15] B. Cao, D. Shen, J.-T. Sun, Q. Yang, Z. Chen, Feature selection in a kernel space,

Proceedings of International Conference on Machine Learning (2007) 121–128.[16] O. Chapelle , V. Vapnik , O. Bousquet , S. Mukherjee , Choosing multiple parame-

ters for support vector machines, Mach. Learn. 46 (1–3) (2002) 131–159 . [17] K. Charalampous , A. Gasteratos , Online deep learning method for action recog-

nition, Pattern Anal. Appl. 19 (2) (2016) 337–354 . [18] J. Demsar , Statistical comparisons of classifiers over multiple data sets, J. Mach.

Learn. Res. 7 (2006) 1–30 .

[19] J. Demsar , Algorithms for subsetting attribute values with relief, Mach. Learn.78 (3) (2010) 421–428 .

20] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF:a deep convolutional activation feature for generic visual recognition, Proceed-

ings of International Conference on Machine Learning (2014) 647–655.

[21] D.L. Donoho , Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006)1289–1306 .

22] T. Evgeniou , M. Pontil , T. Poggio , Regularization networks and support vectormachines, Adv. Comput. Math. 13 (1) (20 0 0) 1–50 .

23] S.R. Fanello , I. Gori , G. Meta , F. Odone , J. Machine , Keep it simple and sparse:real-time action recognition, Learn. Res. 14 (2013) 2617–2640 .

[24] F. Friedrichs , C. Igel , Evolutionary tuning of multiple SVM parameters, Neuro-computing 64 (2005) 107–117 .

25] J. Gascón-Moreno, E.G. Ortiz-García, S. Salcedo-Sanz, A. Paniagua-Tineo, B.

Saavedra-Moreno, J.A. Portilla-Figueras, Multi-parametric gaussian kernel func-tion optimization for ε-SVMr using a genetic algorithm, Proceedings of Inter-

national Conference on Artificial Neural Networks(2011) 113–120. 26] N. Ghalelis, H. Kim, A. Hilton, N. Nikolaidis, I. Pitas, The i3DPost multi-view

and 3d human action/interaction database, Proceedings of IEEE InternationalConference for Visual Media Production (2009) 159–168.

[27] T. Glasmachers, Gradient Based Optimization of Support Vector Machines,

Ph.D. Thesis, Ruhr University Bochum (2008). 28] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier networks, Proceedings of

International Conference on Artificial Intelligence and Statistics 315323 (2011).29] I. Goodfellow , Y. Bengio , A. Courville , Deep Learning, MIT Press, 2016 .

30] Y. Grandvalet , S. Canu , Adaptive scaling for feature selection using SVMs, Neu-ral Inf. Process. Syst. (2002) 569–576 .

[31] K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, Proceedings

of International Conference on Machine Learning (2010) 399–406. 32] Y. Gu , H. Liu , Sample-screening MKL method via boosting strategy for hyper-

spectral image classification, Neurocomputing 173 (1) (2016) 1630–1639 . [33] T. Guha , R.K. Ward , Learning sparse representations for human action recogni-

tion, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2) (2012) 1576–1588 . 34] I. Guyon , J. Weston , S. Barnhill , V. Vapnik , Gene selection for cancer classi-

fication using support vector machines, Mach. Learn. 46 (1–3) (2002) 389–

422 . [35] I. Guyon , A. Elisseeff, An introduction to variable and feature selection, J. Mach.

Learn. Res. 3 (2003) 1157–1182 . 36] I. Guyon , S. Gunn , M. Nikravesh , L. Zadeh , Feature extraction: foundations and

applications, Studies in Fuzziness and Soft Computing, Springer, 2006 . [37] D.A. Harville , Matrix Algebra from a Staticician’s Perspective, Springer, 2008 .

38] T. Hastie , R. Tibshirani , J. Friedman , The Elements of Statistical Learning: Data

Mining, Inference, and Prediction, 2nd ed., Springer, 2009 . 39] L. Hermes, J.M. Buhmann, Feature selection for support vector machines, Pro-

ceedings of IEEE International Conference on Pattern Recognition II (20 0 0)712–715.

40] M.B. Holte, T.B. Moeslund, N. Nikolaidis, I. Pitas, 3d human action recognitionfor multi-view camera systems, Proceedings of IEEE International Conference

on 3D Imaging, Modeling, Processing, Visualization and Transmission (2011)

342–349. [41] K. Huang , S. Aviyente , Sparse representation for signal classification, Neural Inf.

Process. Syst. (2006) 609–616 . 42] K. Hwang , K. Lee , C. Lee , S. Park , Multi-class classification using a signomial

function, J. Oper. Res. Soc. 66 (3) (2015) 434–449 . 43] C. Igel , T. Glasmachers , B. Mersch , N. Pfeifer , P. Meinicke , Gradient-based opti-

mization of kernel-target alignment for sequence kernels applied to bacterialgene start detection, IEEE/ACM Trans. Computational Biology and Bioinformat-

ics 4 (2) (2007) 216–226 .

44] S. Ji , W. Xu , M. Yang , K. Yu , 3d convolutional neural networks for human actionrecognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231 .

45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,T. Darrell, Caffe: convolutional architecture for fast feature embedding, Pro-

ceedings of ACM International Conference on Multimedia(2014) 675–678. 46] C. Kai , D. Guiguang , H. Jungong , Attribute–based supervised deep learning

model for action recognition, Front. Comput. Sci. 11 (2) (2017) 219–229 .

[47] N. Kingsbury, D.B.H. Tay, M. Palaniswami, Multi-scale kernel methods for clas-sification, Proceedings of IEEE Workshop on Machine Learning for Signal Pro-

cessing (2005) 43–48. 48] A. Krizhevsky , I. Sutskever , G.E. Hinton , Imagenet classification with deep con-

volutional neural networks, Neural Inf. Process. Syst. (2012) 1097–1105 . 49] R. Kohavi , G.H. John , Wrappers for feature subset selection, Artif. Intell. 97

(1–2) (1997) 273–324 .

50] V. Koltchinskii , M. Yuan , Sparsity in multiple kernel learning, Ann. Stat. 38 (6)(2010) 36603695 .

[51] B. Krishnapuram , L. Carin , M.T. Figueiredo , A.J. Hartemink , Sparse multinomiallogistic regression: fast algorithms and generalization bounds, IEEE Trans. Pat-

tern Anal. Mach. Intell. 27 (8) (2005) 957–968 . 52] T.N. Lal , O. Chapelle , J. Weston , A. Elisseeff, Embedded methods, in: I. Guyon,

S. Gunn, M. Nikravesh, L. Zadeh (Eds.), Feature Extraction: Foundations

and Applications Studies in Fuzziness and Soft Computing, Springer, 2006,pp. 137–165 .

53] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human ac-tions from movies, Proceedings of IEEE International Conference on Computer

Vision and Pattern Recognition (2008) 23–28. 54] K. Lee , N. Kim , M.-K. Jeong , The sparse signomial classification and regression

model, Ann. Oper. Res. (2012) 1–30 .

55] L. Lefakis, F. Fleuret, Jointly informative feature selection, Proceedings of In-ternational Joint Conference on Artificial Intelligence and Statistics (2014)

567575.

Page 17: Feature weighting for multinomial kernel logistic regression and ...w3.uqo.ca/allimo01/doc/j_papers/NEURO_2018.pdf · plicit criteria such as the RIELIEF algorithm [15,91], fr-MKLR

1768 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768

R

[56] Q. Li , H. Cheng , Y. Zhou , G. Huo , Human action recognition using improvedsalient dense trajectories, Comput. Intell. Neurosci. 2016 (2016) . Article ID

6750459, 11 pages. [57] M. Lichman , UCI Machine Learning Repository, University of California, School

of Information and Computer Sciences, Irvine, 2013 . [58] Y.Y. Lin, J.H. Hua, N.C. Tang, M.H. Chen, H.Y.M. Liao, Depth and skeleton asso-

ciated action recognition without online accessibl RGB-d cameras, Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition (2014) 61–70.

[59] J. Liu, B. Kuipers, S. Savarese, Recognizing human actions by attributes, Pro-

ceedings of IEEE Conference on Computer Vision and Pattern Recognition(2011) 3337–3344.

[60] W. Liu , H. Liu , D. Tao , Y. Wang , K. Lu , Multiview Hessian regularized logisticregression for action recognition, Signal Process. 110 (2015) 101107 .

[61] L. Mancera, J. Portilla, L0-norm-based sparse representation through alternateprojections, Proceedings of IEEE Conference on Image Processing (2006) 2089–

2092.

[62] K.-P. Murphy , Machine Learning: A Probabilistic Perspective, MIT Press, 2013 . [63] J.C. Niebles , H. Wang , L. Fei-Fei , Unsupervised learning of human action cate-

gories using spatial-temporal words, Int. J. Comput. Vis. 79 (3) (2008) 299–318 .[64] J. Paul , R. D’Ambrosio , P. Dupon , Kernel methods for heterogeneous feature se-

lection, Neurocomputing 169 (2015) 187195 . [65] S. Perkins , K. Lacker , J. Theiler , Grafting: Fast, incremental feature selection by

gradient descent in function space, J. Mach. Learn. Res. 3 (2003) 1333–1356 .

[66] O. Ouyed, M.S. Allili, Feature relevance for kernel logistic regression and appli-cation to action classification, Proceedings of IEEE International Conference on

Pattern Recognition (2014) 1325–1329. [67] A. Rakotomamonjy , Variable selection using SVM-based criteria, J. Mach. Learn.

Res. 3 (2003) 1357–1370 . [68] M. Ravanbakhsh, H. Mousavi, M. Rastegari, V. Murino, L.S. Davis, Action Recog-

nition with Image Based CNN Features, 2015, Arxiv: CoRR abs/1512.03980

(2015). [69] F. Rinaldi , F. Schoen , M. Sciandrone , Concave programming for minimizing the

zero-norm over polyhedral sets, Comput. Optim. Appl. 46 (3) (2010) 467–486 . [70] M. Pérez-Ortiz , P.A. Gutiérrez , J. Sánchez-Monedero , C. Hervás-Martínez , A

study on multi-scale kernel optimisation via centered kernel-target alignment,Neural Process. Lett. 44 (2) (2016) 491–517 .

[71] M. Ranzato , Y.-L. Boureau , Y. LeCun , Sparse feature learning for deep belief net-

works, Neural Inf. Process. Syst. (2008) 1185–1192 . [72] V. Roth , The generalized LASSO, IEEE Trans. Neural Netw. 15 (1) (2004) 16–28 .

[73] R. Salakhutdinov, G.E. Hinton, Deep Boltzmann machines, Proceedings of Inter-national Conference on Artificial Intelligence and Statistics (2009) 448–455.

[74] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVMaproach, Proceedings of IEEE International Conference on Pattern Recognition

(2004) 32–36.

[75] G. Sharma , A. Agarwala , B. Bhattacharya , A fast parallel Gauss–Jordan algo-rithm for matrix inversion using CUDA, Comput. Struct. 128 (2013) 31–37 .

[76] A. Shamsheyeva, A. Sowmya, The anisotropic Gaussian kernel for SVM classifi-cation of HRTC images of the lung, Proceedings of Intelligent Sensors, Sensor

Networks and Information Processing Conference (2004) 439–4 4 4. [77] J. Shawe-Taylor , N. Cristianini , Kernel Methods for Pattern Analysis, Cambridge

University Press, 2004 . [78] S. Sonnenburg , G. Rätsch , C. Schäfer , et al. , Large scale multiple kernel learning,

J. Mach. Learn. Res. 7 (2006) 15311565 .

[79] S. Spurlock, H. Wu, R. Souvenir, Multi-view recognition using weighted viewselections, Proceedings Asian Conference on Computer Vision (2014) 538–552.

[80] Y. Sun , Iterative RELIEF for feature weighting: Algorithms, theories, and appli-cations, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 1035–1051 .

[81] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Van-houcke, A. Rabinovich, Going deeper with convolutions, Proceedings of IEEE

Conference on Computer Vision and Pattern Recognition (2015) 1–9.

[82] K. Tarkkola , Feature extraction by non-parametric mutual information maxi-mization, J. Mach. Learn. Res. 3 (2003) 1415–1438 .

[83] R. Tibshirani , Regression srinkage and selection via the lasso, J. R. Stat. Soc. B58 (1) (1996) 267–288 .

[84] M.E. Tipping , Sparse bayesian learning and the relevance vector machine, J.Mach. Learn. Res. 1 (2001) 211–244 .

[85] H.-B. Tu , L.-M. Xia , Z.-W. Wang , The complex action recognition via the corre-lated topic model, Sci. World J. 2014 (2014) . Article ID 810185, 10 pages.

[86] V.N. Vapnik , Statistical Learning Theory, John Wiley & Sons, 1998 . [87] H. Wang , A. Klaser , C. Schmid , C.L. Liu , Dense trajectories and motion boundary

descriptors for action recognition, Int. J. Comput. Vis. 103 (1) (2013) 60–79 .

[88] J. Weston , A. Elisseff, B. Schölkopf , Use of the zero-norm with linear modelsand kernel models, J. Mach. Learn. Res. 3 (2003) 1439–1461 .

[89] J. Weston , S. Mukherjee , O. Chapelle , M. Pontil , T. Poggio , Feature selection forSVMs, Neural Inf. Process. Syst. (20 0 0) 668–674 .

[90] J. Wright , A.Y. Yang , A. Ganesh , S.S. Sastry , Y. Ma , Robust face recognition viasparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009)

210–227 .

[91] S.-H. Yang, Y.-J. Yang, B.-G. Hu, Sparse kernel-based feature weighting, Pro-ceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining,

PAKDD, (2008) 813–820. [92] W. Yang, Y. Wang, G. Mori, Human action recognition from a single clip per

action, Proceedings of ICCV Workshops (2009) 4 82–4 89. [93] J. Yin , Z. Liu , Z. Jin , W. Yang , Kernel sparse representation based classification,

Neurocomputing 77 (1) (2012) 120–128 .

[94] M. Yuan , Y. Lin , Model selection and estimation in regression with groupedvariables, J. R. Stat. Soc. Ser. B 68 (1) (2006) 4967 .

[95] Y. Yuan , X. Zheng , X. Lu , A discriminative representation for human actionrecognition, Pattern Recognit. 59 (2016) 88–97 .

[96] L. Zhang , W.-D. Zhou , P.-C. Chang , J. Liu , Z. Yan , T. Wang , F.-Z. Li , Kernelsparse representation-based classifier, IEEE Trans. Signal Process. 60 (4) (2012)

1684–1695 .

[97] J. Zhu , T. Hastie , Kernel logistic regression and import vector machine, J. Com-put. Graph. Stat. 14 (1) (2005) 185–205 .

Ouyed Ouiza received the B.Eng. and M.Sc. degrees inelectrical engineering from Universite de Mouloud Mam-

meri de Tizi-Ouzou (Algeria) in 2004 and 2009, respec-tively. Since 2010, he has been pursuing Ph.D. studies at

Universit du Qubec in Outaouais (Canada). Her primary

research interests include statistical models and applica-tion to image segmentation, action recognition and its ap-

plications.

Mohand Said Allili received the M.Sc. and Ph.D. degrees

in computer science from the University of Sherbrooke,

Sherbrooke, QC, Canada, in 2004 and 2008, respectively.Since June 2008, he has been an Assistant Professor of

computer science with the Department of Computer Sci-ence and Engineering, Universit du Qubec en Outaouais,

Canada. His main research interests include computer vi-sion and graphics, image processing, pattern recognition,

and machine learning. Dr. Allili was a recipient of the Best

Ph.D. Thesis Award in engineering and natural sciencesfrom the University of Sherbrooke for 2008 and the Best

Student Paper and Best Vision Paper awards for two ofhis papers at the Canadian Conference on Computer and

obot Vision 2007 and 2010, respectively.