feature weighting for multinomial kernel logistic regression and...
TRANSCRIPT
Neurocomputing 275 (2018) 1752–1768
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Feature weighting for multinomial kernel logistic regression and
application to action recognition
Ouiza Ouyed
a , Mohand Said Allili a , b , ∗
a Department of Computer Science and Engineering, University of Quebec in Outaouais, Gatineau J8X 3X7, Quebec, Canada b Département d’informatique et d’ingénierie, Université du Québec en Outaouais 101, Rue St-Jean-Bosco, Local: B-2022, Gatineau J8X 3X7, Québec, Canada
a r t i c l e i n f o
Article history:
Received 13 January 2017
Revised 13 October 2017
Accepted 23 October 2017
Available online 7 November 2017
Communicated by Dr. Xin Luo
Keywords:
Multinomial kernel logistic regression
Feature relevance
Sparse models
Video action recognition
a b s t r a c t
Multinominal kernel logistic regression (MKLR) is a supervised classification method designed for sep-
arating classes with non-linear boundaries. However, it relies on the assumption that all features are
equally important, which may decrease classification performance when dealing with high-dimensional
and noisy data. We propose an approach for embedding feature relevance in multinomial kernel logistic
regression. Our approach, coined fr-MKLR, generalizes MKLR by introducing a feature weighting scheme
in the Gaussian kernel and using the so-called � 0 -“norm” as sparsity-promoting regularization. Therefore,
the contribution of each feature is tuned according to its relevance for classification which leads to more
generalizable and interpretable sparse models for classification. Application of our approach to several
standard datasets and video action recognition has provided very promising results compared to other
methods.
© 2017 Elsevier B.V. All rights reserved.
b
[
t
f
s
a
d
r
l
o
s
t
t
o
t
a
a
p
t
u
l
a
1. Introduction
Sparse models for classification have received an increasing at-
tention for supervised learning [41,59,60] . These include the least
absolute shrinkage and selection operator (LASSO) [72,83,94] , import
vector machine (IVM) [97] , sparse representation-based classification
(SRC) [33,93,96] and many others. Generally speaking, model spar-
sity in classifiers can come in two different flavours. First (instance
selection), sparsity can refer to selecting subsets (or instances) of
learning data that are representative for the classification at hand
[10,33,93,96] . This approach is used, for example, in support vector
machines (SVM) [86] and IVM [97] to produce compact sets of data
(i.e., support vectors) used for classification. In SRC [93,96] , classi-
fication is based on minimizing the reconstruction error with re-
gard to class data. Second (feature selection), sparsity refers to re-
ducing the feature space to a small subset of dimensions carrying
sufficient information for classification; this paper deals with this
type of sparsity. Worth mentioning are also methods for learning
representations using deep neural networks (DNNs), where rele-
vant features are automatically extracted for recognition problems
[8,29] . However, given the huge number of parameters required for
representation learning in DNNs, a large number of training data
and computation time are required [48,81] . To reduce the num-
∗ Corresponding author at: Department of Computer Science and Engineering,
University of Quebec in Outaouais, Gatineau J8X 3X7, Quebec, Canada.
E-mail address: [email protected] (M.S. Allili).
b
r
c
a
https://doi.org/10.1016/j.neucom.2017.10.024
0925-2312/© 2017 Elsevier B.V. All rights reserved.
er of parameters, sparse models for DNNs have been proposed in
28,71] . These methods, although gaining computation efficiency at
he expense of some accuracy, still require a huge number of data
or representation learning.
Recently, kernel-based methods have shown a considerable
uccess in the literature of supervised classification [77] . They
re based on implicit non-linear embedding of data into high-
imensional spaces using the kernel trick to enable linear sepa-
ation of classes in the target space which translates into non-
inear boundaries in the original space [86] . To reduce the effect
f high-dimensionality, methods have been proposed to introduce
parsity in kernel-based classification. For instance, [97] proposed
he import vector machine (IVM) method based on kernel logis-
ic regression. In a similar way to SVM, IVM uses only a fraction
f the training data to index the kernel basis functions. However,
hanks to its probabilistic formulation, IVM is more easily extend-
ble to multi-class classification. Following the success of the SRC
pproach applied for face recognition [90] , other methods were
roposed to obtain sparse classifiers by minimizing reconstruc-
ion errors with regard to class data [33,93,96] . These methods
se mainly � 1 -based regularization to achieve sparsity in model
earning, but they are computationally intensive. Noteworthy are
lso methods using multiple kernel learning (MKL) for achieving
etter classification generalization [13,47,64] . Motivated by multi-
esolution wavelet theory, these methods allow to learn complex
lass boundaries and induce sparsity accommodating fine details
nd large smooth class boundaries [4,47,50] .
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1753
s
c
t
t
[
t
g
a
t
a
a
d
t
K
t
s
a
r
f
e
r
g
fi
t
t
B
a
m
c
s
w
o
w
s
m
r
t
w
2
p
h
s
l
fi
t
s
e
m
a
t
b
t
s
a
d
b
m
[
a
t
e
a
i
p
c
i
s
l
[
i
n
i
b
d
[
o
i
s
a
I
“
s
r
Most of the above methods embed sparsity in the kernel space
panned by the instances of the learning data. That is, sparsity
omes in term of data instances after implicit kernel transforma-
ion using the inner product. However, the effect of noisy fea-
ures of the original space on classification remains unaffected
33,64] . To alleviate this effect, some methods [15,91] have at-
empted to weight features in kernel spaces using the RELIEF al-
orithm [80] . Although these methods can efficiently rank features
ccording to their class discrimination, they do not give optimal
hresholds for assigning final class labels [19] . Other methods use
nisotropic kernels in SVM-based classifiers by weighting features
ccording to their relevance for classification. Feature weights are
etermined in these methods by direct optimization using evolu-
ionary algorithms [24,25] or gradient descent methods [16,27,76] .
ernel alignment is also used for efficient gradient-based optimiza-
ion in multi-scale kernel modeling [27,43,70] . However, since the
ame weights are assigned for all classes, these methods are more
dapted to binary classification. Indeed, in multi-class problems,
elevant features for discriminating one class may not be relevant
or discriminating other classes [35,36] .
In this paper, we propose the fr-MKLR approach (feature rel-
vance in multinomial kernel logistic regression) which incorpo-
ates feature relevance in non-linear and multi-class logistic re-
ression. Unlike SVM and MKLR, fr-MKLR produces sparse classi-
cation models by embedding feature relevance (FR) directly in
he kernel function and using an appropriate penalization term in
he likelihood function inducing feature sparsity in the final model.
eside formulating FR for each class data, fr-MKLR can deal with
rbitrary numbers of classes and dimensions of data and produce
ore interpretable classification models by selecting the most dis-
riminative features for each class data. Our contributions can be
ummarized in the following points:
• We propose an extension of multinomial kernel logistic regres-
sion (MKLR) by including a feature-based sparsity scheme al-
leviating the effect of noisy features and enhancing the capa-
bility of classification generalization. To the best of our knowl-
edge, sparse MKLR in terms of features has not been investi-
gated in the past. Similarly to MKLR, fr-MKLR produces an esti-
mate of the posterior probability for class data through the logit
regression which makes it easy to extend to multi-class data
[86,97] . However, it yields better classification generalization
than MKLR since the effect of irrelevant features is reduced.
• Contrarily to kernel methods ranking features based on ex-
plicit criteria such as the RIELIEF algorithm [15,91] , fr-MKLR is
an embedded method allowing a seamless integration of fea-
ture weighting in the logistic regression model. In addition,
it assigns different weights allowing to select the most dis-
criminative features for each class. Finally, as in methods us-
ing anisotropic kernels in SVM classifiers, either through multi-
scale kernel learning (MKL) [13,47,64] or direct feature weight-
ing [16,43,76] , fr-MKLR enables to fit smooth and fine de-
tailed class boundaries for better class discrimination. However,
having a probabilistic formulation which is readily extendable
for multi-class problems, fr-MKLR is more flexible than these
methods [16,24,65,76,89] .
• When comparing fr-MKLR to several methods such as LASSO,
KSVM and SMLR, fr-MKLR provides better generalization in case
of scarce and overlapping class data with non-linear bound-
aries. Our approach has been first validated on several syn-
thetic and real-world standard datasets and has demonstrated
more generalization performance than compared methods. We
applied fr-MKLR to human action recognition in videos. Action
recognition is a very challenging task due to high-dimensionally
and variability of video data, the diversity of human actions
and inter-class overlapping [1] . Our application is based on
new action representation using shape context analysis of hu-
man silhouettes. Validation of our approach on standard action
datasets (e.g., KTH, UIUC and I3d) has yielded better perfor-
mance to recent methods based on naive Bayes, SMLR, KSVM
and LASSO methods.
Early results of this work have been published in [66] . Herein,
e give an in-depth theoretical analysis and literature review for
ur approach. Moreover, extensive validation results are presented
ith comparison to other classification methods.
The rest of this paper is organized as follows. Section 2 de-
cribes some related work. Section 3 describes MKLR and our
odel background. Section 4 describes our approach for feature
elevance using MKLR. Section 5 provides a comparative evalua-
ion of our method to some existing approaches. We end the paper
ith a conclusion and future work perspectives.
. Related works
Sparse modeling (SM) has received an important focus for su-
ervised and unsupervised learning. For unsupervised learning, SM
as been used mainly for signal reconstruction [61] , compressed
ensing [14,21] and sparse coding [31] , whereas for supervised
earning, it has been used mainly for regression [83,84] and classi-
cation [62,88] . Since our work deals with classification, we limit
he scope of our related work to methods promoting sparsity in
upervised classification.
The literature of sparse models for classification starts with
arly approaches preforming feature selection (FS), which can be
ainly grouped into three categories. Filters select subsets of vari-
bles based on measures such as information gain [55,82] and sta-
istical correlation analysis [35] . They are usually fast but can be
iased since variables are chosen independently of the classifica-
ion model. Wrappers choose subsets of features using the clas-
ification model as a black box [49] . Although they are less bi-
sed than filters, wrappers are computationally intensive. Embed-
ed methods bridge the gap between filters and wrappers by em-
edding FS in the structure of the classifiers [36,42,54] . Embedded
ethods for FS can be roughly categorized into three main groups
52] . 1) forward-backward methods iteratively add/remove variables
ccording to specific criteria such as the rate of change of an objec-
ive function [34,39,54,65] or the sensitivity to the leave-one-out
rror [67] , 2) scaling factor methods use hyper-parameters that are
djusted by model selection [2,3,30] or by minimizing a general-
zation error bound [89] , and 3) direct optimization methods incor-
orate sparsity promoting terms based on the � 0 and � 1 norms that
ause irrelevant feature weights to vanish [38] . Among methods
n this category we can find the � 1 -regularized SVM [65,89] and
parse multinomial logistic regression (SMLR) [51] proposed for
inear and kernel-based classification.
Sparsity in linear classifiers has been investigated in SVM
12,89] and sparse multinomial logistic regression (SMLR) [51] us-
ng � 1 -based regularization. With the advent of kernel methods,
ew possibilities have risen for improving classification algorithms
n high-dimensional spaces [77] . Methods have attempted to com-
ine sparsity and kernel methods to achieve two types of data re-
uctions ( instance and feature selection). In a similar way to SVM
86] , Zhu and Hastie [97] have proposed the IVM algorithm based
n the MKLR formulation. The IVM uses a greedy search to select
nstances of data to build support vectors that provide better clas-
ification generalization. Though a better performance than MKLR
nd SVM is reported, the algorithm is computationally intensive.
n a similar way, Weston et al. [88] have used an approximate � 0 -
norm” to yield sparsity in SVM classification. This approach has
hown better performance than using SVM with � 1 - or � 2 -based
egularization.
1754 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
p
r
s
l
b
M
R
w
c
t
s
s
a
L
w
λ
t
t
1
t
w
t
b
L
w
K
K
p
e
o
a
t
l
w
r
t
a
c
[
N
T
f
L
Sparse representations for recognition can be obtained using
dictionary learning [23,33,93,96] and deep learning [5,29,44,46] .
Approaches based on dictionary learning approximate an input sig-
nal using sparse linear combinations of data instances after pro-
jection into kernel spaces. Class assignment is then performed by
minimizing the reconstruction error of newly observed data with
regard to labeled ones for each class [23] . This approach has been
used in [33] for action recognition in videos. Recently, representa-
tion learning using deep neural networks (DNN) has received much
interest for recognition problems [29] . Convolutional neural net-
works (CNNs) is one type of DNN that has shown a good promise
for object recognition in still images [20,48] . CNNs have also been
extended to 3D configurations for action recognition in videos
[5,17,44,46,68] . However, to ensure good success, these methods,
as for dictionary learning, require a huge number of training data.
In addition, DNNs is computationally intensive given the number
of layers and parameters involved in deep architectures [45,48] .
To ensure good spatial fitting of class boundaries, multiple ker-
nel learning (MKL) has been recently proposed for supervised clas-
sification [4,6,47] . MKL is one particular type of multiple ker-
nel methods [13,78] which combines kernels with different scales
through a multi-kernel learning framework. This enables better
classification generalization in spatially scattered and/or dense re-
gions in the leaning data. For MKL, a critical issue is to determine
the different kernel coefficients weighting the contribution of each
kernel function in the final decision boundary [6] . Methods us-
ing direct optimization have been proposed for MKL in SVM-based
classification [6,32,47] . However, increasing the number of kernel
functions can greatly increase the complexity of the optimization.
Another way to improve classification generalization is by using
anisotropic kernels to reduce the influence of irrelevant features
[16,24,25,43,76] . They consist in assigning different kernel weights
for the different dimensions of the data in order to contribute
each feature according to its discrimination capability. Optimiza-
tion techniques are used to fix each dimension weight based on
the learning data. However, most methods using this approaches
are limited to binary classification [43,70] .
Most of the sparse methods have been designed for binary
classification, and their formulation remains not easily extend-
able to multi-class problems. To overcome this limitation, we pro-
pose a sparse model (fr-MKLR) based on multinomial kernel logis-
tic regression (MKLR) which embeds feature relevance in kernel-
based classification. Contrarily to methods implementing spar-
sity in terms of kernel functions (i.e., MKL [6,32,47] ) or data in-
stances (i.e., dictionary learning [23,33,93,96] ), our method pro-
motes feature sparsity using anisotropic kernels fitting different
types of class boundaries. Given its probabilistic formulation, fr-
MKLR performs soft classification while remaining easily extend-
able to multi-class problems.
Contrarily to SMLR [51] weighting explicitly features in the orig-
inal or kernel spaces, our method embeds feature weighting di-
rectly in kernel construction and separately for each class data.
This enables to fit non-linear class boundaries according to each
class discriminative features. Unlike sparse representation learning
using dictionary or deep learning, our model does not require a
huge number of data for ensuring good efficiency. Moreover, by
performing direct optimization using a gradient-descent method,
our model is computationally more efficient. Several experiments
have demonstrated that our method compares favorably with re-
gard to other sparse methods for several classification problems.
3. Multinomial kernel logistic regression
Multinomial kernel logistic regression (MKLR) is a supervised
learning method that produces non-linear classification boundaries
by transforming an input variable space into another space using a
ositive-definite kernel K(., . ) . The relationship between SVM and
egularized function estimation in the reproducing kernel Hilbert
paces (RKHS) has been established in [22] . By replacing the hinge
oss function of SVM with the negative log-likelihood (NLL) of the
inomial distribution, the same relation can be established with
KLR [38,97] .
More specifically, let us have n instances of training data x i ∈
d , i ∈ { 1 , . . . , n } , with d measured features for each instance,
hich are generated from m classes ( m ≥ 2). We associate an en-
oding vector y i = [ y (1) i
, y (2) i
, . . . , y (m ) i
] T for each data point x i , such
hat y ( j) i
= 1 if x i belongs to the class j and y ( j) i
= 0 , otherwise. The
ymbol [ · ] T denotes transpose of vectors/matrices. For binary clas-
ification ( m = 2 ), we have y i ∈ {0, 1} and fitting a decision bound-
ry is equivalent to searching a function f minimizing the NLL [97] :
( f ) = −n ∑
i =1
y i f (x i ) + ln [ 1 + exp ( f (x i )) ] +
λ
2
‖ f ‖
2 H K
, (1)
here H K is the RKHS generated by the kernel K(., . ) and
controls the contribution of the regularization term ‖ f ‖ 2 H K
hat smoothes f . Note that the NLL (1) is obtained by set-
ing p(y i = 1 | x i ) = exp ( f (x i )) / [1 + exp ( f (x i ))] and p(y i = 0 | x i ) = / [1 + exp ( f (x i ))] . The optimal solution f ( x ) for minimizing (1) has
he form:
f (x ) =
n ∑
i =1
a i K(x , x i ) , (2)
here a i , i ∈ { 1 , . . . , n } , are real-valued coefficients. By defining the
wo vectors a = [ a 1 , . . . , a n ] T and y = [ y 1 , . . . , y n ]
T , function (1) can
e re-written in a compact form as follows:
(a ) = −y T Ka + 1
T ln
[1 + exp (Ka )
]+
λ
2
a T Ka , (3)
here 1 = [1 , 1 , . . . , 1] T is an n -dimensional vector of ones and
is a kernel matrix of dimension n × n with entries given by
r,s = K(x r , x s ) , r, s ∈ { 1 , . . . , n } . Also, we have adopted the com-
act form ln [ 1 + exp (Ka ) ] for [
ln (1 + exp (K (1 , . ) a )) , . . . , ln (1 +xp (K (n, . ) a ))
]T , with K ( i , .), i ∈ { 1 , . . . , n } , designating the i th row
f K .
In a multi-class setting ( m > 2), MKLR gives the posterior prob-
bility of class j given an observation x i which is written as p ( j) i
=p(y
( j) i
= 1 | x i ) . By defining a separate function f j ( x ) for each class j ,
he posterior probability of class j given x i can be written as fol-
ows:
p(y ( j) i
= 1 | x i ) =
exp ( f j (x i )) ∑ m
h =1 exp ( f h (x i )) , j = 1 , . . . , m, (4)
here f j (x i ) ∈ H K is defined as: f j (x ) =
∑ n i =1 a i j K(x , x i ) .
We put the coefficients of each function f j into a sepa-
ate vector a j = [ a 1 j , . . . , a n j ] T . Because
∑ m
j=1 p ( j) i
= 1 , we have
p (m ) i
= 1 − ∑ m −1 j=1 p
( j) i
. Thus, by setting a m
= 0 as for linear logis-
ic regression [51] , only the set of parameters A = { a 1 , . . . , a m −1 }re to be learned. For each data point x i , therefore, we asso-
iate a vector containing the class posterior probabilities p i = p (1)
i , p (2)
i , . . . , p (m −1)
i ] T defined as follows:
p ( j) i
= p(y ( j) i
= 1 | x i ) =
exp ( f j (x i ))
1 +
∑ m −1 h =1 exp ( f h (x i ))
, j =1 , . . . , m −1 . (5)
ote that p (m ) i
= 1 − ∑ m −1 j=1 p
( j) i
=
[ 1 +
∑ m −1 h =1
exp ( f h (x i )) ] −1
.
herefore, the multi-class penalized NLL can be formulated as
ollows (see Appendix A ):
(A ) =
m −1 ∑
j=1
−y ( j) T Ka j + 1
T ln
[
1 +
m −1 ∑
h =1
exp (Ka h )
]
+
λ
2
m −1 ∑
j=1
a T j Ka j ,
(6)
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1755
w
4
n
e
t
p
f
t
i
w
t
I
t
K
w
w
s
4
d
f
K
w
c
a
M
t
A
c
i
t
(
=
t
i
t
f
‖
w
[
t
O
t
s
f
L
w
K
m
a
w
o
s
v
t
o
l{w
1
H
m
{
n
s
w
a
t
t
m
H
F
s(
4
t
K
n
K
T
x
n
n
L
w
t⎧⎨⎩w
B
B
here y ( j) = [ y ( j) 1
, y ( j) 2
, ..., y ( j) n ] T .
. Feature relevance for MKLR
Feature relevance is motivated by the fact that a good combi-
ation of features usually leads to better classification than using
ach feature individually [34] . To deal with a large number of fea-
ures and improve the predictive performance of MKLR, we pro-
ose to directly weight features in the kernel of the radial basis
unction (RBF) of the MKLR. In other words, we use a weighted dis-
ance in the kernel function where each feature is scaled accord-
ng to its relevance to classification. Note that in contrast with [97] ,
here selection of data instances is performed for better classifica-
ion, our approach aims at achieving sparsity in terms of features.
n what follows, without loss of generality, we base our analysis on
he Gaussian kernel defined as:
(x r , x s ) = exp
(−‖ (x r − x s ) ‖
2 / (2 σ 2 ) ), (7)
ith r, s ∈ { 1 , . . . , n } and σ > 0 controls the width of the kernel. In
hat follows, we give the formulation of fr-MKLR for binary clas-
ification and then generalize it to multi-class data.
.1. Feature relevance in case (m = 2)
We use a weighting vector � = [ ψ 1 , . . . , ψ d ] T having the same
imension as our feature space and we plug it into the RBF (7) as
ollows:
˜ (x r , x s ) = exp
(−1
2
(x r − x s ) T diag (�) 2 (x r − x s )
), (8)
here diag( �) designates a diagonal matrix with diagonal entries
ontaining the elements of � . Note that in case all entries of �
re equal, ˜ K is isotropic and the model boils down to the standard
KLR. When the entries of � are different, ˜ K is anisotropic and
he contribution of each individual feature is weighted differently.
s a feature weight decreases to zero, its contribution to distance
alculation in (8) , and therefore to classification, will become less
mportant.
To encourage model sparsity, we propose to add a regulariza-
ion on the weight vector � to the negative log-likelihood function
3) using the � 0 -“norm” [88] . The � 0 -“norm” of � defined as ‖ �‖ 0 card { k | ψ k � = 0 , k = 1 , . . . , d} gives the number of non-zero en-
ries of the vector � . Note that, unlike � q norms with q > 0, ‖ · ‖ 0 s not a norm because the triangle inequality does not hold. Since
he � 0 -“norm” is not smooth, it is usually approximated using the
ollowing function [69] (see Fig. 1 for illustration):
�‖ 0 ≈d ∑
k =1
[ 1 − exp (−βψ k ) ] , (9)
here β is an approximation parameter. It has been shown in
12,88] that for sufficiently high values of β , classifiers can lead
o better generalization while maintaining a good model sparsity.
ur aim by using this penalty term is decreasing weights and con-
ribution of noisy features to classification. This can be achieved by
ubstituting the kernel ˜ K to K in function (3) and minimizing the
ollowing penalized NLL:
(a , �) = −y T ˜ K a + 1
T ln
[1 + exp ( ̃ K a )
]+
λ
2
a T ˜ K a + μd ∑
k =1
[ 1 − exp (−βψ k ) ] , (10)
here ˜ K is a kernel matrix of dimension n × n with entries˜ (x r , x s ) given by Eq. (8) and μ is a regularization parameter. This
inimization can be performed through an iterative process that
lternates between two steps until convergence. In the first step,
e estimate the entries of the vector a (for binary classification,
nly one vector a is estimated). In the second step, for a given
olution a , we minimize function (10) according to the weighting
ector � .
We use the Newton–Raphson (N–R) method to estimate the en-
ries of a and � . Using matrix differentiation rules [37] , the first
rder differential of (10) with respect to a and � is given as fol-
ows (see Appendix B ):
g = ∂ L /∂ a =
˜ K c
= ∂ L /∂ � = [ c T Q 1 a , . . . , c T Q d a ]
T + μβexp (−β�) , (11)
here c =
(− y + p +
λ2 a
)and p = [ p 1 , . . . , p n ]
T where p i = p(y i = | x i ) . We define the matrix Q k , k ∈ { 1 , . . . , d} , as the following
adamard product Q k =
˜ K ◦ B k , where B k is an n × n dimension
atrix with entries defined by B k (r, s ) = −ψ k (x r,k − x s,k ) 2 , r, s ∈
1 , . . . , n } . The final gradient vector of (10) is obtained by concate-
ating the terms of (11) which gives ˜ g =
[g T , T
]T .
To calculate the Hessian of function (10) , note that the Hes-
ian of (10) with respect to the entries of a is ˜ K W ̃
K + λ ˜ K ,
ith W = diag [ p 1 (1 − p 1 ) , p 2 (1 − p 2 ) , . . . , p n (1 − p n )] . We define
lso the second derivatives of the NLL with respect to the parame-
ers a and � as T =
∂ 2 L (a , �)
∂ �∂ �T and M =
∂ 2 L (a , �)
∂ �∂ a T (see Appendix C for
he details of calculation of T and M ). Therefore, the full Hessian
atrix of the NLL is given by:
˜ =
(˜ K W ̃
K + λ ˜ K M
M
T T
), (12)
inally, the N–R update is done using the following iterative
cheme:
a (t+1)
�(t+1)
)=
(a (t)
�(t)
)− ˜ H
−1 ˜ g (13)
.2. Feature relevance in case ( m > 2)
We generalize (10) to the multi-class case by associating a fea-
ure relevance vector �( j) = [ ψ
( j) 1
, ψ
( j) 2
, . . . , ψ
( j) d
] T for each class
j ∈ { 1 , . . . , m − 1 } . Thus, we associate a separate symmetric kernel˜
( j) for each class j encoding the class feature relevance. The ker-
el entries for a class j are calculated as follows:
˜
( j) (x r , x s ) = exp
(−1
2
(x r − x s ) T diag (�( j) ) 2 (x r − x s )
). (14)
he new posterior probabilities of the classes given an observation
i will be similar to those given in Eq. (10) by substituting the ker-
el ˜ K
( j) to K for each class j . Using the � 0 -“norm” penalization, the
ew NLL is given as follows:
(A , �) =
m −1 ∑
j=1
−y ( j) T ˜ K
( j) a ( j) + 1
T ln
[
1 +
m −1 ∑
h =1
exp ( ̃ K
(h ) a (h ) )
]
+
m −1 ∑
j=1
[
λ
2
a ( j) T ˜ K
( j) a ( j) + μd ∑
k =1
[1 − exp (−βψ
( j) k
) ]]
,
(15)
here A = { a (1) , . . . , a (m −1) } and � = { �(1) , . . . , �(m −1) } . Similarly
o Eq. (11) , we have ∀ j ∈ { 1 , . . . , m − 1 } , ∀ k ∈ { 1 , . . . , d} :
g
( j) = ∂ L /∂ a ( j) =
˜ K
( j) c ( j)
( j) = ∂ L /∂ �( j) = [ c ( j) T Q
( j) 1
a ( j) , . . . , c ( j) T Q
( j) d
a ( j) ] T
+ μβexp (−β�( j) ) ,
(16)
here we define c ( j) =
(− y ( j) + p
( j) +
λ2 a
( j) )
and Q
( j) k
=
˜ K
( j) ◦
( j) k
, with B
( j) k
is an n × n matrix having entries defined by
( j) k
(r, s ) = −ψ
( j) k
(x r,k − x s,k ) 2 . It follows that the gradient of L
1756 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
Fig. 1. Iso-contour plots of: (first row) � q -norm with different values of q , (second row) � 0 -norm approximation using Eq. (9) with different values of β .
Algorithm 1 Parameter estimation for fr-MKLR method.
Inputs :- Data set D = { (x 1 , y 1 ) , . . . , (x n , y n ) } . Output :- Parameter vectors (a ( j) , �( j) ) , j ∈ { 1 , . . . , m − 1 } .
�( j) ← �( j) (0)
; a ( j) ← a ( j) (0)
; t ← 1 ;
repeat
E ← 0 ;
for j = 1 → m − 1 do
Compute the gradient terms ∂ L /∂ a ( j) and ∂ L /∂ �( j) using
Eq. (16) ;
Compute the Hessian using using Eq. (19) ;
Update the entries of a ( j) (t)
and �( j) (t)
using Eq. (20) ;
E ← (E + ‖ a ( j) (t+1)
− a ( j) (t)
‖ + ‖ �( j) (t+1)
− �( j) (t)
‖ ) ; end for
t ← t + 1 ;
until ( E < ε OR t > MAXITER )
w
n
5
d
e
u
s
d
d
m
a
a
w
t
r
p
fi
with respect to the vectors a ( j ) ’s and �( j ) ’s will be given by ˜ g =[ g (1) T , . . . , g (m −1) T , (1) T , . . . , (m −1) T
] T .
To calculate the Hessian of function (15) , note that the Hes-
sian with respect to the elements of A is given by the matrix˜ K
∗W
∗ ˜ K
∗ + λ ˜ K
∗, where we define ˜ K
∗ = diag [ ̃ K
1 , . . . , ̃ K
(m −1) ] . The
operator diag[ · ] builds a matrix with diagonal blocks made of the
elements of the arguments. We define also the matrix W
∗ as fol-
lows:
W
∗ =
⎛
⎜ ⎜ ⎝
W 1 , 1 W 1 , 2 . . . W 1 ,m −1
W 2 , 1 W 2 , 2 . . . W 2 ,m −1
. . . . . .
. . . . . .
W m −1 , 1 W m −1 , 2 . . . W m −1 ,m −1
⎞
⎟ ⎟ ⎠
, (17)
with:
W j,� =
{diag [ p ( j)
1 (1 − p ( j)
1 ) , . . . , p ( j)
n (1 − p ( j) n )] i f j = �.
diag [ −p ( j) 1
p (� ) 1
, . . . , −p ( j) n p (� ) n ] i f j � = �.
(18)
Similarly to the case of binary clarification, we need also to cal-
culate matrices T ( j ) and M
( j ) for each class j , j ∈ { 1 , . . . , m − 1 } ,with elements defined as follows: T ( j) =
∂ 2 L ∂ �( j) ∂ �( j) T
and M
( j) =∂ 2 L
∂ a ( j) ∂ �( j) T . The full Hessian matrix with respect to all the parame-
ters is given as follows:
˜ H =
(˜ K
∗W
∗ ˜ K
∗ + λ ˜ K
∗ M
∗
M
∗T T
∗
), (19)
where we have M
∗ = diag [ M
(1) , . . . , M
(m −1) ] T and T ∗ =diag [ T (1) , . . . , T (m −1) ] T . Finally, the N–R update consists of the
following iterative formula: (˜ a (t+1)
˜ �(t+1)
)=
(˜ a (t)
˜ �(t)
)− ˜ H
−1 ˜ g . (20)
where ˜ a = [ a (1) T , a (2) T , . . . , a (m −1) T ] T and
˜ � =[ �(1) T , �(2) T , . . . , �(m −1) T ] T . Algorithm 1 shows the steps for
estimating the parameters of our model. The algorithm ends
hen the estimation reaches a certain precision ε or a maximum
umber of iterations MAXITER.
. Experiments
We have evaluated our method using simulated and standard
atasets as well as for human action recognition in videos. In each
xperiment, we used cross validation (CV) to determine the val-
es of the hyper-parameters λ, β and μ in function (15) and mea-
uring the performance of our method. More specifically, we ran-
omly generated five groups for learning and five groups for vali-
ation (or testing) for each dataset. Classification accuracy (CA) is
easured by averaging its values among the testing groups.
Let N l and N t be the sizes of a learning and testing data in
set containing N data points ( N = N l + N t ). The CA is calculated
s 1 −[
1 5
∑ 5 i =1
n (i ) l
N l
]for training and 1 −
[1 5
∑ 5 i =1
n (i ) t
N t
]for testing,
here n (i ) l
and n (i ) t are the numbers of badly classified points in
he learning and testing sets generated in the i th validation split,
espectively. Obtained results using our method fr-MKLR are com-
ared with MKLR, SMLR, KSVM, LASSO and the naive Bayes classi-
ers.
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1757
Table 1
Used GMM parameters for generating the datasets of Tests I–VI, respectively.
Test # Data Class L j GMMs parameters
I N l = 160 j = 1 1 μ1 , 1 = [1 , 2] T , �1 , 1 = diag ([0 . 08 , 0 . 1]) .
N t = 20 0 0 j = 2 1 μ2 , 1 = [1 . 75 + τδ, 2] T , �2 , 1 = diag ([0 . 2 , 0 . 08]) .
j = 1 4 μ1 , 1 = [1 . 2 , 4] T , μ1 , 2 = [5 . 2 , 4] T , μ1 , 3 = [5 . 2 , 1 . 5] T , μ1 , 4 = [0 . 8 , 1 . 5] T ,
�1 , 1 = diag ([0 . 15 , 0 . 23]) , �1 , 2 = �1 , 3 = �1 , 4 = diag ([0 . 28 , 0 . 28]) ,
II N l = 20 –200 π1 , 1 = π1 , 2 = π1 , 3 = π1 , 4 = 0 . 25 .
N t = 20 0 0 j = 2 2 μ2 , 1 = [3 , 2] T , μ2 , 2 = [3 , 4 . 2] T , π2 , 1 = 0 . 90 , π2 , 2 = 0 . 10 .
�2 , 1 = diag ([0 . 18 , 0 . 33]) , �2 , 2 = diag ([0 . 11 , 0 . 15]) ,
j = 1 2 μ1 , 1 = [1 , 5 . 5] T , μ1 , 2 = [3 , 5 . 5] T , π1 , 1 = π1 , 2 = 0 . 5 .
III N l = 80 �1 , 1 =
(1 0 . 75
0 . 75 1
), �1 , 2 =
(1 − 0 . 75
−0 . 75 1
)N t = 1200 j = 2 2 μ2 , 1 = [1 , 8] T , μ2 , 2 = [3 , 8] T , π2 , 1 = π2 , 2 = 0 . 5 .
�2 , 1 = �1 , 1 , �2 , 2 = �1 , 2
j = 1 1 μ1 , 1 = [1 , 2] T , �1 , 1 = diag ([0 . 5 , 0 . 5]) .
IV N l = 75 j = 2 1 μ2 , 1 = [1 , 4] T , �2 , 1 = diag ([0 . 5 , 0 . 5]) .
N t = 30 0 0 j = 3 1 μ3 , 1 = [4 , 1] T , �3 , 1 = diag ([0 . 5 , 0 . 5]) .
j = 1 1 μ1 , 1 = [3 . 5 , 0 . 25] T , �1 , 1 = diag ([0 . 98 , 0 . 10]) .
j = 2 3 μ2 , 1 = [1 . 2 , 2] T , μ2 , 2 = [2 . 5 , 2] T , μ2 , 3 = [5 , 2] T
V N l = 240 �2 , 1 = diag ([0 . 2 , 0 . 18]) , �2 , 2 = �2 , 3 = diag ([0 . 18 , 0 . 10]) .
N t = 4800 j = 3 1 μ3 , 1 = [4 , 4 . 15] T , �3 , 1 = diag ([0 . 58 , 0 . 20])
j = 4 1 μ4 , 1 = [1 . 5 , 4 . 15] T , �4 , 1 = �3 , 1 .
j = 1 1 μ1 , 1 = [6 , 1 , 1 , 1] T , �1 , 1 = diag ([0 . 5 , 0 . 5 , 0 . 5 , 0 . 5]) .
N l = 50 –500 j = 2 1 μ2 , 1 = [6 , 1 . 5 , 3 . 5 , 1] T , �2 , 1 = �1 , 1 .
VI N t = 1250 j = 3 1 μ3 , 1 = [6 , 1 . 5 , 1 , 3 . 5] T , �3 , 1 = �1 , 1 .
j = 4 2 μ4 , 1 = [1 , 1 , 1 , 1] T , μ4 , 2 = [1 , 3 . 25 , 1 , 1] T .
�4 , 1 = �4 , 2 = �1 , 1 , �1 , 2 , π4 , 1 = π4 , 2 = 0 . 5 .
j = 1 1 μ1 , 1 = (0 , 4 , −5 , 10 , 0 , ..., 0) T ;σ1 , 1 = (1 , 1 , 1 , 2 , 3 , 1 , 1 , 5 , 5 , ..., 5) T ;
N l = 100 (Dots mean the same as the previous dimension value).
VII N t = 100 j = 2 1 μ2 , 1 = (5 , 11 , −2 , 20 , 25 , 10 , 7 , 3 , ..., 3) T ;σ2 , 1 = σ1 , 1 ;
(Dots mean the same as the previous dimension value).
5
5
.1. Data description and presentation
1) Simulated data: We have used finite Gaussian mixture models
(GMMs) to generate data for 7 tests (Test I–VII), with some fea-
tures are purposefully set to have no discrimination between
classes. For each test, each class distribution has the following
general form (values of parameters are given in Table 1 ):
p(x | y ( j) = 1) =
L j ∑
k =1
π j,k p(x | μ j,k , � j,k ) , j ∈ { 1 , . . . , m } , (21)
where L j is the number of components of the GMM generating
class j of the data, π j, k , μj, k and �j, k are the a priori probabil-
ity, the mean vector and covariance matrix of the k -th compo-
nent of the mixture, k ∈ { 1 , . . . , L j } . Test I–III are conducted for
binary classification and show the generalization capability of
our method with different scenarios of class data:
• Test I ( overlapping classes ): data contain two overlapping
classes where only one dimension is relevant for classifica-
tion. Each class data have been generated using a bivariate
Gaussian. We vary the amount of class overlapping by shift-
ing the mean of the first class in one dimension by incre-
ments τ ∈ { 1 , . . . , 10 } of a step δ = 0 . 25 .
• Test II ( scarce learning data ): shows the generalization capa-
bility of our algorithm when learning data are scarce. Each
class has been generated using a mixture of bivariate Gaus-
sians. We vary the number of learning data per Gaussian N g
from 10 to 100.
• Test III ( multi-modal classes ): classes are multimodal and
separated by non-linear boundaries. Each class has been
generated using a mixture of bivariate Gaussians.
We then conducted three other tests (Tests IV–VI) using multi-
class data (the parameters of the tests are given in Table 1 ).
Tests IV and V use data with number of classes m = 3 and
m = 4 , respectively. The data of each class have been generated
using mixtures of bivariate Gaussians. Test VI uses data with
m = 4 and d = 4 where d is the dimensionality of the data. Pro-
jection of the data on two dimensions is shown in Fig. 2 .
Finally, to demonstrate the effectiveness of using the � 0 norm,
we conducted an illustrative test of binary classification (Test
VII) using a simulated dataset of 25 features and N = 200 data,
where only 6 features have a clear discrimination between the
two classes (see Fig. 3 ).
2) UCI real-world data: The datasets used in this experiments are
taken from the UCI machine learning repository [57] . Tested
datasets include Banana (BN), Breast cancer (BC), Diabetes (DB),
Heart (HR), Flare-solar (FS), German (GR), Ringnorm (RG), Thyroid
(TH) and Twonorm (TW) for binary classification, and EMG phys-
ical action (EMG), wine quality (WQ), Ecoli (EC) and Image seg-
mentation (SEG) for multi-class classification. The description of
the UCI datasets is given in Table 2 . Note that, unlike our syn-
thetic data, the class data of the UCI datasets are not necessarily
Gaussian.
.2. Numerical results and comparisons
1) Case of binary classification ( m = 2 ): In Figs. 4 and 5 , the first
to last rows show class boundaries obtained in Test I–III using
MKRL, fr-MKLR and KSVM methods, respectively. We can see
that fr-MKLR has clearly succeeded for both tests in selecting
the best separating feature which led to better generalization
than the other methods. Fig. 6 shows CA values obtained for
both Test I and II tests (by varying class overlapping for Test I
and the number of training data in Test II). Table 3 (first row)
shows the results for Test III. We can observe that fr-MKLR out-
performed the MKLR and KSVM methods in the three tests.
2) Case of multi-class classification ( m > 2): The resulting classifi-
cation boundaries obtained using MKRL, fr-MKLR and KSVM are
shown in Figs. 7 and 8 for Tests IV and V and corresponding CA
are given in Table 3 . Clearly, fr-MKLR has succeeded in selecting
the best separating features for each class, which led to a better
generalization than the other methods. For Test VI, CA values
1758 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
Fig. 2. 2D projection of data of Test VI.
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1759
Fig. 3. Histograms of the 25 features used in Test VII: red (continuous) and blue (dashed) lines show feature distributions in classes j = 1 and j = 2 , respectively. (For
interpretation of the references to color in this figure legend, the reader is referred to the web version of this article).
Table 2
Description of used UCI datasets in our experiments.
Datasets BN BC DB HR FS GR RG TH TW EMG WQ EC SEG
# of instances ( N ) 5300 699 768 270 144 10 0 0 7400 250 7400 10 4 178 336 2310
# of attributes ( d ) 2 10 8 13 9 24 20 5 20 8 13 8 19
# of classes ( m ) 2 2 2 2 2 2 2 2 2 20 3 8 7
Table 3
Average classification accuracy obtained by the compared methods (Standard deviation in brackets).
MKLR fr-MKLR – ( � 1 ) – ( � 2 ) SMLR KSVM Naive Bayes LASSO
Test III 95.10 (3.2) 95.60 (1.4) 93.75 (1.2) 93.50 (1.5) 94.16 93.33 (1.2) 89.19 (2.1) 83.33 (8)
Test IV 94.89 (2.7) 97.26 (2.1) 93.40 (2.3) 92.87 (6.5) 94.21 (2.6) 91.19 (2.6) 94.30 (3.2) 66.56 (0.1)
Test V 91.73 (3.1) 94.33 (2) 91.21 (2.7) 89.77 (4.2) 85.93 (3.9) 94 (1.4) 96.39 (1.2) 77.17 (0.6)
using learning and testing data are given in Fig. 6 by varying N l
from 50 to 500. We can note that using learning data for vali-
dation, MKLR and KSVM had better performance than the other
methods. When using testing data for validation, fr-MKLR and
Naive Bayes lead to better performance that the other methods.
Moreover, for N l ≥ 20, fr-MKLR has performed better than Naive
Bayes. This shows that in the presence of high-dimensional and
scarce learning data, fr-MKLR has tendency to provide better
generalization. Finally, obtained results for Test VI are shown
in Fig. 6 and Table 3 , respectively, where, again, fr-MKLR has
yielded batter performance than compared methods.
To show the effectiveness of using the � 0 norm, we created two
other versions of fr-MKLR by replacing � 0 norm which induces
sparsity by the � 1 and � 2 norms, respectively. Comparison re-
sults between the fr-MKLR versions using the different scores
are shown for Tests I and II and VI in Fig. 6 and in Table 3 for
Tests III–V. For all tests, using the � 0 norm has yielded better
results than the other norms. Finally, Fig. 9 presents a graph
1760 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
Fig. 4. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test I data (first row), Test II data (second row) and Test III data (third row),
respectively. For each method, a 2D scatter is shown using learning data.
Table 4
Comparaison of LOGREG-LASSO, fr-MKLR and SMLR for UCI datasets repository
(Standard deviation in brackets).
Dataset LOGREG-LASSO [72] (%) fr-MKLR (%) SMLR [51] (%)
Banana 89.3 (0.5) 91.52 (0.3) 81.66 (5.6)
Breast cancer 73.9 (4.6) 91.02 (3.7) 91 (1.2)
Diabetes 76.5 (1.9) 78.51 (1.3) 76 (3.6)
Heart 84 (3.1) 86.07 (2.4) 81.1 (8.2)
Flare-solar 66.73 (1.6) 70.15 (1) 55.43 (8)
German 76.4 (2.3) 85 (1.6) 74.5 (5.4)
Ringnorm 98.2 (0.3) 98.9 (0.2) 96.3 (3.4)
Thyroid 95.2 (2.3) 94.23 (1.4) 88.89 (6.5)
twonorm 97.4 (0.2) 98.6 (0.2) 94.2 (3)
showing the number of retained relevant features (i.e., with
weight ψ k > 0) as a function of the number of iterations in Test
VII. Clearly, � 0 norm has allowed to isolate the exact number of
relevant features and with a smaller number of iterations than
� 1 and � 2 . All these tests demonstrate the effectiveness of using
the � 0 for obtaining good sparse models for classification.
3) Results for the UCI datastes: In Tables 4 and 5 , we present
comparative results obtained for the different UCI datasets. We
first compared our method with LOGREG-LASSO [72] and SMLR
[51] methods on several datasets. Obtained results are shown
in Table 4 , where we used the classification accuracy averaged
over 100 training/test splits as suggested in the compared pa-
per. We note that results of [72] are taken directly as reported
by the authors of the paper. We can see that, except for the
thyroid dataset, fr-MKLR gives sensibly better results than us-
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1761
Fig. 5. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test I data (first row), Test II data (second row) and Test III data (third row),
respectively. For each method, a 2D scatter is shown using testing data.
Table 5
Average classification accuracy obtained by the compared methods for UCI datasets repository (Standard deviation in brack-
ets).
EMG action(M > 2) EMG action(M = 2) Wine quality Ecoli Image segmentation
MKLR 68.65(2.1) 65.74 (2.2) 94.44 (1.6) 82.94 (2.6) 84.35 (2.2)
fr-MKLR( μ = 0 . 05 ) 68.38(2.5) 65.36 (2.2) 94.44(1.6) 83 (2.9) 84.35 (2.3)
– ( μ = 2 . 5 ) 72.52(0.8) 96.23 (1.6) 96.83 (0.9) 84.71 (2.5) 86.39 (1.6)
– ( μ = 4 . 5 ) 68.20(2.5) 91.35 (2.1) 93.52 (1.8) 82.9 (3.3) 83.67(2.8)
– ( � 1 ) 63.50 (8.8) 95.62 (1.4) 94.44 (2.4) 71.55 (3.4) 85.71 (1.9)
– ( � 2 ) 62.77 (7.5) 94.53 (5.8) 92.86 (2) 70.71 (20) 83.67 (3.8)
SMLR 68.45(0.9) 90.03 (0.9) 96.56 (4) 83.52 (8.6) 83.54 (8.4)
KSVM 68.98 (2.5) 95.50 (0.7) 94.44 (1.8) 80.33 (2.5) 72.79 (3.8)
Naive Bayes 85.90 (3.3) 93.67 (1.2) 93.65 (2.4) 66.10 (7.6) 78.91 (3.7)
LASSO 62,64 (1.6) – 62.17 (8.6) 54.81 (5.9) 64.24 (3.5)
1762 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
Fig. 6. Classification accuracy (CA) obtained in Tests I and II and VI, respectively. For each test, left figures shows comparison of fr-MKLR with MKLR, NB, LASSO, SMLR and
KSVM, and right figure shows comparison of fr-MKLR implementations using � 0 , � 1 and � 2 norms, respectively.
ing LOGREG-LASSO and SMLR. The prediction differences can be
explained by the nature of the datasets and the structure of the
classification models. In addiction of promoting feature sparsity,
fr-MKLR has flexibility to fit better non-linear class boundaries
characterizing in our case the majority of used datasets. The re-
sults of LOGREG-LASSO and SMLR can be explained by the lin-
ear structure of LOGREG-LASSO and the use of component-wise
procedure to select features in the case of SMLR which provides
less efficiency than fr-MKLR on these datasets.
In Table 5 , comparison of our method with MKLR, SMLR, KSVM,
NB and LASSO are presented is presented for other UCI datasets.
We also show results for different versions of fr-MKLR using
different norms inducing sparsity and values for the sparsity
coefficient μ. For the sparsity coefficient, three values are tested
μ∈ {0.05, 2.5, 4.5}, among which the value μ = 2 . 5 has been
obtained by cross-validation. Note that when μ = 0 . 05 , almost
the same results as MKLR are obtained for all datasets. When
μ = 4 . 5 , the performance of fr-MKLR decreases as shown in
the table. Clearly, our model fr-MKLR, with μ = 2 . 5 obtained
by cross-validation, has yielded the best results compared to
other methods. These results also demonstrate the ability of our
method to perform well when dealing with non-Gaussian data.
Finally, to show the significance of performance improvement
of our method with regard to compared ones, we used the
Wilcoxon test [18] with confidence level α = 0.05 and using
different numbers of datasets N = (7, 8, 9, 10, 17) from our UCI
and synthetic datasets. In what follows, we show the critical
values T for the statistic T for each number of datasets as
NO. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1763
Fig. 7. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test IV data (first row) and Test V data (second row) respectively. For each
method, a 2D scatter is shown using learning data.
5
n
M
o
[
o
r
o
p
p
a
t
p
[
i
e
q
g
v
t
e
c
r
u
6
W
d
[
a
d
K
well as the observed values of T when comparing algorithms
using these datasets: T N=7 = 2 ( T LASSO = 0 ), T N=8 = 4 ( T KSV M
=0 , T f r−MKLR −l1 = 0 , T f r−MKLR −l2 = 0 and T MKLR = 0 ), T N=9 = 6
( T Logreg = 2 ), T N=10 = 8 ( T NB = 8 ) and T N=17 = 35 ( T SMLR = 0 ). We
can note that the observed values of T are always equal or
less than the critical values. Therefore, we can conclude that fr-
MKLR has significant performance improvement than the com-
pared methods.
.3. Application to video action recognition
Several methods in the past have cast the human action recog-
ition as a classification problem in high-dimensional spaces [1] .
ethods have used classifiers on video descriptors such as KSVM
n interest points [53] , KNN on optical flow (HOF) Optical flow
92] and dense trajectories + SVM [56] for action description. Most
f these descriptions are generally very high-dimensional, whereas
elevant information for discrimination may lie only in a handful
f dimensions.
Motivated by the obtained performance for fr-MKLR, we pro-
ose to use our model for action recognition in videos by ex-
loiting sparsity to identify useful dimensions while operating in
multi-class setting. To capture local and global information of ac-
ions which is necessary for better action recognition [9] , we pro-
ose a representation based on the shape context descriptor (SCD)
7,66] . To calculate the SCD, we first exact video foregrounds us-
ng the algorithm proposed in [11] . Then, a shape histogram of 128
ntries (i.e, bins) is calculated for each silhouette through the se-
uence. Contrarily to [7] , we use one reference point which is the
ravity center of the silhouette and each bin is quantified to the
alues 1 or 0 meaning the presence/absence of the object con-
our in the bin (see Fig. 10 for illustration). The final descriptor of
ach action consists of the mean and standard deviation of the SCD
omputed through the sequence. Finally, to ensure high-level rep-
esentation of actions, we perform first dimensionality reduction
sing restricted Boltzmann machines (RBM) [73] . The RBM uses
4 hidden units and the 128 shape context entries as input data.
e have evaluated our method for action recognition using three
atasets: KTH [53] , UIUC [59] and the I3DPost multi-view database
26] . Each dataset contain single person performing basic action in
controlled environment, examples of actions are given in Fig. 11 .
• KTH contains 6 type of actions ( walking, jogging, running, box-
ing, hand waving, hand clapping ) performed several times by 25
subjects in 4 different scenarios: outdoors, outdoors with scale
variation, outdoors with different clothes , and indoors , the total
of 600 videos.
• UIUC consists of 14 actions ( walking, running, jumping, waving,
jumping jacks, clapping, jumping from sit up, raising one hand,
stretching out, turning,sitting to standing, crawling, pushing up,
standing to sitting ) performed by 8 subjects in total it contain
532 videos.
• I3DPost contains multi-view actions of 768 videos captured
with 8 cameras and performed by 8 subjects (2 females and
6 males) for 12 actions ( bend, hand wave, jump, jump in place,
run, walk, run-fall, run-jump-walk, sit-stand-up, walk-sit, hand-
shake, pull ). Note that since we aim to classify only single per-
son action, we removed the last two actions ( handshake , and
pull ) from our tests.
Table 6 gives average values of CA obtained using the three
atasets for the compared methods. These include fr-MKLR, MKLR,
SVM, and LASSO using our action description based on SCD and
1764 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
Fig. 8. Examples illustrating classification boundaries obtained by MKLR, fr-MKLR and KSVM for Test IV data (first row) and Test V data (second row) respectively. For each
method, a 2D scatter is shown using testing data.
Fig. 9. Graph showing the number of retained relevant features for Test VII as a
function of number of iterations.
Fig. 10. Distribution of bins for the SCD used for action representation.
a
t
n
d
d
f
a
m
d
t
u
d
the baseline and recent methods for each dataset, namely interest
points + KSVM [53] , interest points + pLSA and LDA [63] , space-
time features + SVM [74] , Optical flow + KNN [92] , local space-time
features + part-based model + multi-task learning [60] , parame-
terized representation + discriminative classifiers [95] , dense tra-
jectories + SVM [56] and recent results using convolutional neural
networks [17,46,68] . Optical flow + Correlated Topic Model (CTM)
[85] , spatiotemporal volumes + KNN [59] , dense trajectories + mo-
tion boundary histogram + SVM [87] and depth map + skeleton
structure + multi-kernel learning [58] for UIUC. For i3DPost multi-
view dataset, comparison is made with reported results in [40] us-
ing 3D motion context, [79] using motion features and SVM for
ction recognition and [26] using only 5 among the 12 actions of
he dataset.
Results of the experiments are shown on Table 6 . We can
ote that fr-MKLR outperforms all methods for the KTH and UIUC
atasets. For KTH, authors in [56,60,95] have improved the video
escriptors to boost the performance of action recognition. There-
ore, they obtained better performance than fr-MKLR. However, by
pplying the RBM, we obtained the best performance among all
ethods. The same performance has been obtained for the UIUC
ataset. For i3DPost dataset, [79] has obtained higher performance
han our method. This is partly due to the quality of the features
sed in [79] which take into account the multiview setting of the
ataset. Indeed, 3D motion context (MC) using all views informa-
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1765
Fig. 11. Examples of actions of UIUC , i3DPost and KTH datasets.
Table 6
Average classification accuracy obtained by compared methods for three data base: KTH(6 classes),
UIUC(14 classes) and i3dpost(10 classes), (-) means values not reported in the original papers.
Datasets Methods Testing accuracy (%)
KTH Banerjee et al. [5] 90
Charalampous et al. [17] 91.9
Ji et al. [44] 90.2
Kai et al. [46] 88.7
Laptev et al. [53] 91.8
Li et al. [56] 97.6
Liu et al. [60] 94.3
Niebles et al. [63] 81.5
Ravanbakhsh et al. [68] 94.1
Shuldt et al. [74] 71.7
Yang et al. [92] 75
Yuan et al. [95] 96.3
LASSO 40
KSVM 76.6
MKLR 87.7
RBM + MKLR 98.5
fr-MKLR 93.3
RBM + fr-MKLR 98.6
UIUC Hong et al. [85] 93.3
Lin et al. [58] 98.7
Liu et al. [59] 93.5
Wang et al. [87] 97.1
LASSO 61
KSVM 92.2
MKLR 95.6
RBM + MKLR 96.2
fr-MKLR 98.8
RBM + fr-MKLR 99.5
I3d actions Ghalelis et al. [26] (5 actions) 90
Holte et al. [40] 3D - MC 80
– 3D - MC - mean 77.5
– HMC 76.2
– HMC - mean 68.7
Spurlock et al. [79] RT 73.7
– MC 96.2
LASSO 67.2
KSVM 72
MKLR 68.8
RBM + MKLR 88.6
fr-MKLR 77.6
RBM + fr-MKLR 90
1766 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
A
L
B
f
l
w
E
A
w
t
Q
T
i
∂
A
t
H
w
t
tion in [40] and MC with discriminative views in [79] perform well
than using the same features for all the views. Nonetheless, the ap-
plication of RBM has improved considerably our results. Finally, we
must put on emphasis the performance gap between RBM+MKLR
and RBM+fr-MKLR. Indeed, the majority of deep learning methods
apply softmax functions (e.g., MKLR) on the top layer of the net-
work for classification. Given the obtained performance by fr-MKLR
over MKLR, our method constitutes a good alternative for softmax
functions in classification problems using DNNs.
5.4. Computational analysis
Since most of the time is taken by model training, we discuss
the computational time induced by the training step. Having N l
data points, distance calculation for each kernel matrix will require
N l (N l − 1) / 2 steps which can be performed in parallel. The calcu-
lation of gradient and Hessian terms using Eqs. (16) and (19) has
a linear computational complexity ∼ O ( mN l ). The NLL minimiza-
tion is performed iteratively using Eq. (20) . Therefore, the compu-
tational complexity induced by a single iteration is approximately
∼ O ([ m (d + N l )] 2 . 8 ) since it involves matrix inversion.
Knowing that we deal usually with scarce learning data, com-
putational time of the above steps can be significantly reduced
using newly-developed computer hardware. For example, matrix
inversion can be reduced to nearly linear complexity using the
method proposed in [75] . We have used the MATLAB platform on
a PC with Intel(R) core(TM) i7-3920XM at 2.9 GHz CPU to run our
experiments and compared average execution time including both
learning and testing phases. The following values have been ob-
tained: 1.68 s for fr-MKLR, 1.17 s for MKLR, 0.52 s for LASSO, 0.28 s
for Naive Bayes and 38.4 s for KSVM. We can note that, fr-MKLR
and MKLR have almost similar execution times even if fr-MKLR
has an additional calculation step. LASSO and Naive Bayes require
lesser computation time use simple probability calculation. KSVM
is the slowest algorithm because of the one versus all process used
to deal with the multi-class case. This allows us to conclude that
our approach does not add much burden to classification in term
of computation time.
6. Conclusions and discussion
We have presented the fr-MKLR method incorporating feature
relevance in multinomial kernel logistic regression. It consists of
using anisotropic kernels by embedding weights controlling fea-
tures contribution to classification. Obtained models are sparse
and enable better classification generalization than several stan-
dard methods such as naive Bayes, MKLR, SMLR, KSVM and LASSO.
We have applied our method to binary and multi-class data classi-
fication using simulated and standard datasets as well as on an ap-
plication to video action recognition. Experiments have shown that
the proposed approach outperforms compared methods in cases of
non-linear class boundaries, scarce training data and the presence
of redundant features. However, as for any kernel-based method,
when the number of training data is very large, computational
time becomes a limitation. Also, for linearly separable data (e.g.,
text classification), fr-MKLR and its competitors MKLR and KSVM,
can be less efficient than their linear counterparts.
Acknowledgment
This work has been completed with the support of the Natural
Sciences and Engineering Research Council of Canada (NSERC).
ppendix A. Derivation of the NLL in Eq. (6)
Using the posterior probabilities in Eq. (5) , we have:
(A ) = −n ∑
i =1
(
m −1 ∑
j=1
y ( j) i
ln
[exp ( f j (x i ))
1 +
∑ m −1 h =1 exp ( f h (x i ))
]
+
(
1 −m −1 ∑
j=1
y ( j) i
)
ln
[1
1 +
∑ m −1 h =1 exp ( f h (x i ))
])
+
λ
2
m −1 ∑
j=1
a ( j) T Ka ( j) (22)
=
n ∑
i =1
(
m −1 ∑
j=1
−y ( j) i
f j (x i ) +
λ
2
a ( j) T Ka ( j)
+ ln
[
1 +
m −1 ∑
h =1
exp ( f h (x i ))
] )
, (23)
y defining y ( j) = [ y ( j) 1
, y ( j) 2
, . . . , y ( j) n ] T and using the compact
orm:
n
[
1 +
m −1 ∑
h =1
exp (Ka (h ) )
]
=
⎛
⎜ ⎜ ⎜ ⎝
ln
[1 +
∑ m −1 h =1 exp (K (1 , . ) a (h ) )
]ln
[1 +
∑ m −1 h =1 exp (K (2 , . ) a (h ) )
]. . .
ln
[1 +
∑ m −1 h =1 exp (K (n, . ) a (h ) )
]
⎞
⎟ ⎟ ⎟ ⎠
,
(24)
here K ( i , .), i ∈ { 1 , . . . , n } , denotes the i th row of K , we obtain
q. (6) .
ppendix B. Derivation of the terms of Eq. (11)
Given the following derivation:
∂ ln
[1 + exp ( ̃ K (i, . ) a )
]∂a
=
exp ( ̃ K (i, . ) a )
1 + exp ( ̃ K (i, . ) a ) ˜ K (i, . ) T = p i ̃ K (., i ) ,
(25)
e have: ∂ L /∂ a = (− ˜ K y +
˜ K p + λ ˜ K a ) =
˜ K c . Using Eq. (8) and ma-
rix derivation properties, we have ∀ k ∈ { 1 , . . . , d} :
k =
∂ ̃ K
∂ψ k
=
⎛
⎜ ⎜ ⎜ ⎜ ⎝
∂ ˜ K (x 1 , x 2 )
∂ψ k
. . . ∂ ˜ K (x 1 , x n )
∂ψ k . . .
. . . . . .
∂ ˜ K (x n , x 1 )
∂ψ k
. . . ∂ ˜ K (x n , x n )
∂ψ k
⎞
⎟ ⎟ ⎟ ⎟ ⎠
=
˜ K ◦ B k . (26)
herefore, we have: ∂ L /∂ ψ k = c T Q k a + μβexp (−βψ k ) , where B k
s the matrix defined in Section 4.1 . By gathering the elements
L /∂ ψ k in one vector, we obtain the second line of Eq. (11) .
ppendix C. Derivation of Eq. (12)
First, we have ∂ p i /∂ a = p i (1 − p i ) ̃ K (i, . ) . Then, the Hessian of
he NLL with respect to the elements of a is given by:
= (KWK + λ ˜ K ) , (27)
here W = diag [ p 1 (1 − p 1 ) , p 2 (1 − p 2 ) , . . . , p n (1 − p n )] . Also, note
hat ∀ k, � ∈ { 1 , . . . , d} and ∀ i ∈ { 1 , . . . , n } :
O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768 1767
T
d
T
M
w
D
a
R
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
∂ 2 ˜ K
∂ ψ k ∂ ψ �
=
⎛
⎜ ⎜ ⎜ ⎜ ⎝
∂ 2 ˜ K (x 1 , x 2 )
∂ ψ k ∂ ψ �
. . . ∂ 2 ˜ K (x 1 , x n )
∂ ψ k ∂ ψ �
. . . . . .
. . .
∂ 2 ˜ K (x n , x 1 )
∂ ψ k ∂ ψ �
. . . ∂ 2 ˜ K (x n , x n )
∂ ψ k ∂ ψ �
⎞
⎟ ⎟ ⎟ ⎟ ⎠
, (28)
∂ 2 ˜ K
∂ a i ∂ ψ k
=
⎛
⎜ ⎜ ⎜ ⎜ ⎝
∂ 2 ˜ K (x 1 , x 2 )
∂ a i ∂ ψ k
. . . ∂ 2 ˜ K (x 1 , x n )
∂ a i ∂ ψ k . . .
. . . . . .
∂ 2 ˜ K (x n , x 1 )
∂ a i ∂ ψ k
. . . ∂ 2 ˜ K (x n , x n )
∂ a i ∂ ψ k
⎞
⎟ ⎟ ⎟ ⎟ ⎠
. (29)
herefore, we can build the matrices T and M containing the mixed
erivatives as follows:
k� =
∂ 2 L
∂ ψ k ∂ ψ �
=
{[ Q k Wa ] T Q k a + c T S k a − μβ2 exp (−βψ k ) i f k = �
[ Q � Wa ] T Q k a + c T (Q k ◦ B � ) a i f k � = �
(30)
ik =
∂ 2 L
∂ a i ∂ ψ k
= Q k (i, . )(−y + p + λa ) +
˜ K (i, . ) Q k Wa (31)
here S k =
˜ K ◦ (D k + B k ◦ B k ) and D k is an n × n matrix where
k (r, s ) = −(x r,k − x s,k ) 2 . By putting together the elements of H, M
nd T , we obtain the Hessian of Eq. (12) .
eferences
[1] J.K. Aggarwal, M.S. Ryoo, Human activity analysis: a review, ACM Comput. Surv.43 (3) (2011) . https://dl.acm.org/citation.cfm?id=1922653 . Article 16.
[2] M.S. Allili , S. Bacha , Feature Relevance in Bayesian Network Classifiers and Ap-
plication to Image Event Recognition, FLAIRS Conference (2017) 760–763 . [3] M.S. Allili , D. Ziou , Likelihood-based feature relevance for figure-ground seg-
mentation in images and videos, Neurocomputing 167 (2015) 658–670 . [4] F. Bach , Consistency of the group lasso and multiple kernel learning, J. Mach.
Learn. Res. 9 (2008) 1179–1225 . [5] B. Banerjee, V. Murino, Efficient pooling of image based CNN features for ac-
tion recognition in videos, Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing (2017) 2637–2641. [6] J. Baoa, Y. Chena, L. Yub, C. Chena, A multi-scale kernek learning method and
its application in image classfication, Neurocomputing (2017), doi: 10.1016/j.neucom.2016.11.069 .
[7] S. Belongie , J. Malik , J. Puzicha , Shape matching and object recognition usingshape contexts, IEEE Trans. Pattern Anal. Mach. Intell. 24 (4) (2002) 509–522 .
[8] Y. Bengio , A. Courville , P. Vincent , Representation learning: a review and new
perspectives, IEEE Trans. Pattern Anal. Mach. Intell. 35 (8) (2013) 1798–1828 . [9] R. Blake , M. Shiffrar , Perception of human motion, Annu. Rev. Psychol. 58 (1)
(2007) 47–73 . [10] A. Blum , P. Langley , Selection of relevant features and examples in machine
learning, Artif. Intell. 97 (1-2) (1997) 245–271 . [11] A. Boulmerka, M.S. Allili, Background modeling in videos revisited using finite
mixtures of generalized Gaussians and spatial information, Proceedings of IEEE
International Conference on Image Processing (2017) pp. 3660–3664. [12] P. Bradley, O. Mangasarian, Features selection via concave minimization and
support vector machine, Proceedings of International Conference on MachineLearning (1998) 82–90.
[13] S.S. Bucak , R. Jin , A.K. Jain , Multiple kernel learning for visual object recogni-tion: a review, IEEE Trans. Pattern Anal. Mach. Intell. 36 (7) (2014) 1354–1369 .
[14] E.J. Candès , M.B. Wakin , An introduction to compressive sampling, IEEE Signal
Process. Mag. 25 (2) (2008) 21–30 . [15] B. Cao, D. Shen, J.-T. Sun, Q. Yang, Z. Chen, Feature selection in a kernel space,
Proceedings of International Conference on Machine Learning (2007) 121–128.[16] O. Chapelle , V. Vapnik , O. Bousquet , S. Mukherjee , Choosing multiple parame-
ters for support vector machines, Mach. Learn. 46 (1–3) (2002) 131–159 . [17] K. Charalampous , A. Gasteratos , Online deep learning method for action recog-
nition, Pattern Anal. Appl. 19 (2) (2016) 337–354 . [18] J. Demsar , Statistical comparisons of classifiers over multiple data sets, J. Mach.
Learn. Res. 7 (2006) 1–30 .
[19] J. Demsar , Algorithms for subsetting attribute values with relief, Mach. Learn.78 (3) (2010) 421–428 .
20] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, DeCAF:a deep convolutional activation feature for generic visual recognition, Proceed-
ings of International Conference on Machine Learning (2014) 647–655.
[21] D.L. Donoho , Compressed sensing, IEEE Trans. Inf. Theory 52 (4) (2006)1289–1306 .
22] T. Evgeniou , M. Pontil , T. Poggio , Regularization networks and support vectormachines, Adv. Comput. Math. 13 (1) (20 0 0) 1–50 .
23] S.R. Fanello , I. Gori , G. Meta , F. Odone , J. Machine , Keep it simple and sparse:real-time action recognition, Learn. Res. 14 (2013) 2617–2640 .
[24] F. Friedrichs , C. Igel , Evolutionary tuning of multiple SVM parameters, Neuro-computing 64 (2005) 107–117 .
25] J. Gascón-Moreno, E.G. Ortiz-García, S. Salcedo-Sanz, A. Paniagua-Tineo, B.
Saavedra-Moreno, J.A. Portilla-Figueras, Multi-parametric gaussian kernel func-tion optimization for ε-SVMr using a genetic algorithm, Proceedings of Inter-
national Conference on Artificial Neural Networks(2011) 113–120. 26] N. Ghalelis, H. Kim, A. Hilton, N. Nikolaidis, I. Pitas, The i3DPost multi-view
and 3d human action/interaction database, Proceedings of IEEE InternationalConference for Visual Media Production (2009) 159–168.
[27] T. Glasmachers, Gradient Based Optimization of Support Vector Machines,
Ph.D. Thesis, Ruhr University Bochum (2008). 28] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier networks, Proceedings of
International Conference on Artificial Intelligence and Statistics 315323 (2011).29] I. Goodfellow , Y. Bengio , A. Courville , Deep Learning, MIT Press, 2016 .
30] Y. Grandvalet , S. Canu , Adaptive scaling for feature selection using SVMs, Neu-ral Inf. Process. Syst. (2002) 569–576 .
[31] K. Gregor, Y. LeCun, Learning fast approximations of sparse coding, Proceedings
of International Conference on Machine Learning (2010) 399–406. 32] Y. Gu , H. Liu , Sample-screening MKL method via boosting strategy for hyper-
spectral image classification, Neurocomputing 173 (1) (2016) 1630–1639 . [33] T. Guha , R.K. Ward , Learning sparse representations for human action recogni-
tion, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2) (2012) 1576–1588 . 34] I. Guyon , J. Weston , S. Barnhill , V. Vapnik , Gene selection for cancer classi-
fication using support vector machines, Mach. Learn. 46 (1–3) (2002) 389–
422 . [35] I. Guyon , A. Elisseeff, An introduction to variable and feature selection, J. Mach.
Learn. Res. 3 (2003) 1157–1182 . 36] I. Guyon , S. Gunn , M. Nikravesh , L. Zadeh , Feature extraction: foundations and
applications, Studies in Fuzziness and Soft Computing, Springer, 2006 . [37] D.A. Harville , Matrix Algebra from a Staticician’s Perspective, Springer, 2008 .
38] T. Hastie , R. Tibshirani , J. Friedman , The Elements of Statistical Learning: Data
Mining, Inference, and Prediction, 2nd ed., Springer, 2009 . 39] L. Hermes, J.M. Buhmann, Feature selection for support vector machines, Pro-
ceedings of IEEE International Conference on Pattern Recognition II (20 0 0)712–715.
40] M.B. Holte, T.B. Moeslund, N. Nikolaidis, I. Pitas, 3d human action recognitionfor multi-view camera systems, Proceedings of IEEE International Conference
on 3D Imaging, Modeling, Processing, Visualization and Transmission (2011)
342–349. [41] K. Huang , S. Aviyente , Sparse representation for signal classification, Neural Inf.
Process. Syst. (2006) 609–616 . 42] K. Hwang , K. Lee , C. Lee , S. Park , Multi-class classification using a signomial
function, J. Oper. Res. Soc. 66 (3) (2015) 434–449 . 43] C. Igel , T. Glasmachers , B. Mersch , N. Pfeifer , P. Meinicke , Gradient-based opti-
mization of kernel-target alignment for sequence kernels applied to bacterialgene start detection, IEEE/ACM Trans. Computational Biology and Bioinformat-
ics 4 (2) (2007) 216–226 .
44] S. Ji , W. Xu , M. Yang , K. Yu , 3d convolutional neural networks for human actionrecognition, IEEE Trans. Pattern Anal. Mach. Intell. 35 (1) (2013) 221–231 .
45] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,T. Darrell, Caffe: convolutional architecture for fast feature embedding, Pro-
ceedings of ACM International Conference on Multimedia(2014) 675–678. 46] C. Kai , D. Guiguang , H. Jungong , Attribute–based supervised deep learning
model for action recognition, Front. Comput. Sci. 11 (2) (2017) 219–229 .
[47] N. Kingsbury, D.B.H. Tay, M. Palaniswami, Multi-scale kernel methods for clas-sification, Proceedings of IEEE Workshop on Machine Learning for Signal Pro-
cessing (2005) 43–48. 48] A. Krizhevsky , I. Sutskever , G.E. Hinton , Imagenet classification with deep con-
volutional neural networks, Neural Inf. Process. Syst. (2012) 1097–1105 . 49] R. Kohavi , G.H. John , Wrappers for feature subset selection, Artif. Intell. 97
(1–2) (1997) 273–324 .
50] V. Koltchinskii , M. Yuan , Sparsity in multiple kernel learning, Ann. Stat. 38 (6)(2010) 36603695 .
[51] B. Krishnapuram , L. Carin , M.T. Figueiredo , A.J. Hartemink , Sparse multinomiallogistic regression: fast algorithms and generalization bounds, IEEE Trans. Pat-
tern Anal. Mach. Intell. 27 (8) (2005) 957–968 . 52] T.N. Lal , O. Chapelle , J. Weston , A. Elisseeff, Embedded methods, in: I. Guyon,
S. Gunn, M. Nikravesh, L. Zadeh (Eds.), Feature Extraction: Foundations
and Applications Studies in Fuzziness and Soft Computing, Springer, 2006,pp. 137–165 .
53] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human ac-tions from movies, Proceedings of IEEE International Conference on Computer
Vision and Pattern Recognition (2008) 23–28. 54] K. Lee , N. Kim , M.-K. Jeong , The sparse signomial classification and regression
model, Ann. Oper. Res. (2012) 1–30 .
55] L. Lefakis, F. Fleuret, Jointly informative feature selection, Proceedings of In-ternational Joint Conference on Artificial Intelligence and Statistics (2014)
567575.
1768 O. Ouyed, M.S. Allili / Neurocomputing 275 (2018) 1752–1768
R
[56] Q. Li , H. Cheng , Y. Zhou , G. Huo , Human action recognition using improvedsalient dense trajectories, Comput. Intell. Neurosci. 2016 (2016) . Article ID
6750459, 11 pages. [57] M. Lichman , UCI Machine Learning Repository, University of California, School
of Information and Computer Sciences, Irvine, 2013 . [58] Y.Y. Lin, J.H. Hua, N.C. Tang, M.H. Chen, H.Y.M. Liao, Depth and skeleton asso-
ciated action recognition without online accessibl RGB-d cameras, Proceedingsof IEEE Conference on Computer Vision and Pattern Recognition (2014) 61–70.
[59] J. Liu, B. Kuipers, S. Savarese, Recognizing human actions by attributes, Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition(2011) 3337–3344.
[60] W. Liu , H. Liu , D. Tao , Y. Wang , K. Lu , Multiview Hessian regularized logisticregression for action recognition, Signal Process. 110 (2015) 101107 .
[61] L. Mancera, J. Portilla, L0-norm-based sparse representation through alternateprojections, Proceedings of IEEE Conference on Image Processing (2006) 2089–
2092.
[62] K.-P. Murphy , Machine Learning: A Probabilistic Perspective, MIT Press, 2013 . [63] J.C. Niebles , H. Wang , L. Fei-Fei , Unsupervised learning of human action cate-
gories using spatial-temporal words, Int. J. Comput. Vis. 79 (3) (2008) 299–318 .[64] J. Paul , R. D’Ambrosio , P. Dupon , Kernel methods for heterogeneous feature se-
lection, Neurocomputing 169 (2015) 187195 . [65] S. Perkins , K. Lacker , J. Theiler , Grafting: Fast, incremental feature selection by
gradient descent in function space, J. Mach. Learn. Res. 3 (2003) 1333–1356 .
[66] O. Ouyed, M.S. Allili, Feature relevance for kernel logistic regression and appli-cation to action classification, Proceedings of IEEE International Conference on
Pattern Recognition (2014) 1325–1329. [67] A. Rakotomamonjy , Variable selection using SVM-based criteria, J. Mach. Learn.
Res. 3 (2003) 1357–1370 . [68] M. Ravanbakhsh, H. Mousavi, M. Rastegari, V. Murino, L.S. Davis, Action Recog-
nition with Image Based CNN Features, 2015, Arxiv: CoRR abs/1512.03980
(2015). [69] F. Rinaldi , F. Schoen , M. Sciandrone , Concave programming for minimizing the
zero-norm over polyhedral sets, Comput. Optim. Appl. 46 (3) (2010) 467–486 . [70] M. Pérez-Ortiz , P.A. Gutiérrez , J. Sánchez-Monedero , C. Hervás-Martínez , A
study on multi-scale kernel optimisation via centered kernel-target alignment,Neural Process. Lett. 44 (2) (2016) 491–517 .
[71] M. Ranzato , Y.-L. Boureau , Y. LeCun , Sparse feature learning for deep belief net-
works, Neural Inf. Process. Syst. (2008) 1185–1192 . [72] V. Roth , The generalized LASSO, IEEE Trans. Neural Netw. 15 (1) (2004) 16–28 .
[73] R. Salakhutdinov, G.E. Hinton, Deep Boltzmann machines, Proceedings of Inter-national Conference on Artificial Intelligence and Statistics (2009) 448–455.
[74] C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVMaproach, Proceedings of IEEE International Conference on Pattern Recognition
(2004) 32–36.
[75] G. Sharma , A. Agarwala , B. Bhattacharya , A fast parallel Gauss–Jordan algo-rithm for matrix inversion using CUDA, Comput. Struct. 128 (2013) 31–37 .
[76] A. Shamsheyeva, A. Sowmya, The anisotropic Gaussian kernel for SVM classifi-cation of HRTC images of the lung, Proceedings of Intelligent Sensors, Sensor
Networks and Information Processing Conference (2004) 439–4 4 4. [77] J. Shawe-Taylor , N. Cristianini , Kernel Methods for Pattern Analysis, Cambridge
University Press, 2004 . [78] S. Sonnenburg , G. Rätsch , C. Schäfer , et al. , Large scale multiple kernel learning,
J. Mach. Learn. Res. 7 (2006) 15311565 .
[79] S. Spurlock, H. Wu, R. Souvenir, Multi-view recognition using weighted viewselections, Proceedings Asian Conference on Computer Vision (2014) 538–552.
[80] Y. Sun , Iterative RELIEF for feature weighting: Algorithms, theories, and appli-cations, IEEE Trans. Pattern Anal. Mach. Intell. 29 (6) (2007) 1035–1051 .
[81] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Van-houcke, A. Rabinovich, Going deeper with convolutions, Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition (2015) 1–9.
[82] K. Tarkkola , Feature extraction by non-parametric mutual information maxi-mization, J. Mach. Learn. Res. 3 (2003) 1415–1438 .
[83] R. Tibshirani , Regression srinkage and selection via the lasso, J. R. Stat. Soc. B58 (1) (1996) 267–288 .
[84] M.E. Tipping , Sparse bayesian learning and the relevance vector machine, J.Mach. Learn. Res. 1 (2001) 211–244 .
[85] H.-B. Tu , L.-M. Xia , Z.-W. Wang , The complex action recognition via the corre-lated topic model, Sci. World J. 2014 (2014) . Article ID 810185, 10 pages.
[86] V.N. Vapnik , Statistical Learning Theory, John Wiley & Sons, 1998 . [87] H. Wang , A. Klaser , C. Schmid , C.L. Liu , Dense trajectories and motion boundary
descriptors for action recognition, Int. J. Comput. Vis. 103 (1) (2013) 60–79 .
[88] J. Weston , A. Elisseff, B. Schölkopf , Use of the zero-norm with linear modelsand kernel models, J. Mach. Learn. Res. 3 (2003) 1439–1461 .
[89] J. Weston , S. Mukherjee , O. Chapelle , M. Pontil , T. Poggio , Feature selection forSVMs, Neural Inf. Process. Syst. (20 0 0) 668–674 .
[90] J. Wright , A.Y. Yang , A. Ganesh , S.S. Sastry , Y. Ma , Robust face recognition viasparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009)
210–227 .
[91] S.-H. Yang, Y.-J. Yang, B.-G. Hu, Sparse kernel-based feature weighting, Pro-ceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining,
PAKDD, (2008) 813–820. [92] W. Yang, Y. Wang, G. Mori, Human action recognition from a single clip per
action, Proceedings of ICCV Workshops (2009) 4 82–4 89. [93] J. Yin , Z. Liu , Z. Jin , W. Yang , Kernel sparse representation based classification,
Neurocomputing 77 (1) (2012) 120–128 .
[94] M. Yuan , Y. Lin , Model selection and estimation in regression with groupedvariables, J. R. Stat. Soc. Ser. B 68 (1) (2006) 4967 .
[95] Y. Yuan , X. Zheng , X. Lu , A discriminative representation for human actionrecognition, Pattern Recognit. 59 (2016) 88–97 .
[96] L. Zhang , W.-D. Zhou , P.-C. Chang , J. Liu , Z. Yan , T. Wang , F.-Z. Li , Kernelsparse representation-based classifier, IEEE Trans. Signal Process. 60 (4) (2012)
1684–1695 .
[97] J. Zhu , T. Hastie , Kernel logistic regression and import vector machine, J. Com-put. Graph. Stat. 14 (1) (2005) 185–205 .
Ouyed Ouiza received the B.Eng. and M.Sc. degrees inelectrical engineering from Universite de Mouloud Mam-
meri de Tizi-Ouzou (Algeria) in 2004 and 2009, respec-tively. Since 2010, he has been pursuing Ph.D. studies at
Universit du Qubec in Outaouais (Canada). Her primary
research interests include statistical models and applica-tion to image segmentation, action recognition and its ap-
plications.
Mohand Said Allili received the M.Sc. and Ph.D. degrees
in computer science from the University of Sherbrooke,
Sherbrooke, QC, Canada, in 2004 and 2008, respectively.Since June 2008, he has been an Assistant Professor of
computer science with the Department of Computer Sci-ence and Engineering, Universit du Qubec en Outaouais,
Canada. His main research interests include computer vi-sion and graphics, image processing, pattern recognition,
and machine learning. Dr. Allili was a recipient of the Best
Ph.D. Thesis Award in engineering and natural sciencesfrom the University of Sherbrooke for 2008 and the Best
Student Paper and Best Vision Paper awards for two ofhis papers at the Canadian Conference on Computer and
obot Vision 2007 and 2010, respectively.