a reconstruction error based framework for multi …davidson/publications/tkde...male, bearded,...

1

A Reconstruction Error Based Framework forMulti-label and Multi-view Learning

Buyue Qian, Xiang Wang, Jieping Ye, and Ian Davidson

Abstract—A significant challenge to make learning techniques more suitable for general purpose use is to move beyond i) completesupervision, ii) low dimensional data, iii) a single label and single view per instance. Solving these challenges allows working withcomplex learning problems that are typically high dimensional with multiple (but possibly incomplete) labelings and views. While otherwork has addressed each of these problems separately, in this paper we show how to address them together, namely semi-superviseddimension reduction for multi-label and multi-view learning (SSDR-MML), which performs optimization for dimension reduction andlabel inference in semi-supervised setting. The proposed framework is designed to handle both multi-label and multi-view learningsettings, and can be easily extended to many useful applications. Our formulation has a number of advantages. We explicitly modelthe information combining mechanism as a data structure (a weight/nearest-neighbor matrix) which allows investigating fundamentalquestions in multi-label and multi-view learning. We address one such question by presenting a general measure to quantify thesuccess of simultaneous learning of multiple labels or views. We empirically demonstrate the usefulness of our SSDR-MML approach,and show that it can outperform many state-of-the-art baseline methods.

Index Terms—Semi-Supervised Learning; Multi-label Learning; Multi-view Learning; Dimension Reduction; Reconstruction Error.

F

1 INTRODUCTION

Four core challenges to making data analysis bettersuited to real world problems is learning from: i) par-tially labeled data, ii) high dimensional data, iii) multi-label, and iv) multi-view data. Whereas existing workoften tackles each of these problems separately, givingrise to the fields of semi-supervised learning, dimensionreduction, multi-label and multi-view learning respec-tively, we propose and show the benefits of addressingthe four challenges together. However, this requires aproblem formulation that is efficiently solvable and eas-ily interpretable. We propose such a framework whichwe refer to as semi-supervised dimension reduction formulti-label and multi-view learning (SSDR-MML).

Consider this simple experiment to illustrate the weak-ness of solving each problem independently. We collect50 frontal well-aligned face images of five people in tendifferent expressions, each of which are associated withthree labels (besides name): gender, bearded, glasses(see Fig. 1). We shall project the face images into a2D space using different techniques that perform un-supervised dimension reduction, supervised dimensionreduction and finally our approach that simultaneouslyperforms semi-supervised learning and dimension re-duction for multi-label data. Fig. 2 shows the result,where each symbol denotes a different person and eachcolor indicates a combination of labels. “Red” stands

• B. Qian, X. Wang, and I. Davidson is with the Department of ComputerScience, University Of California, Davis, CA, 95616.E-mail: {byqian, xiang, indavidson}@ucdavis.edu

• Jieping Ye is with Computer Science and Engineering, Arizona StateUniversity, Tempe, AZ 85287.E-mail: [email protected]

for female, unbearded, and non-glasses; “green” denotesmale, unbearded, and non-glasses; and “blue” indicatesmale, bearded, glasses. For Principal Component Anal-ysis (PCA) [1], an unsupervised dimension reductiontechnique, we see that it finds a mapping of the imagesinto a 2D space (Fig. 2(a)) where people with differentlabels are not well separated. This result is not surprisinggiven PCA does not make use of the labeled data. Usingthe labels for only 30% of images for supervision, we seethat PCA+LDA [2] performs only marginally better inFig. 2(b) because the missing labels can not be inferred.Our approach simultaneously infers the missing labelsand performs dimension reduction for this multi-labeldata and, as shown in Fig. 2(c) to 2(e), produces accuratepredictions and monotonic improvement. During theiterative process, images sharing similar labels graduallyaggregate while dissimilar images move further apart.

Fig. 1. Sample face images

A subsequent risk in applying learning techniques tothese challenging environments is that it is difficult toknow if the learning approach was successful. Elegantframeworks such as structural and empirical risk min-imization, though useful, have not been well extendedand applied to multi-label and multi-view settings. Moreempirical evaluation approaches such as cross-fold vali-dation do not work well in sparse label settings. In thiswork we explore explicitly modeling the mechanism tocombine multiple labels and views as a data structuresuch that we can more clearly see how the labels are

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TKDE.2014.2339860

Copyright (c) 2014 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

−12 −10 −8 −6 −4 −2 0 2 4 6 8−5

0

5

10

(a) PCA−8 −6 −4 −2 0 2 4 6 8

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(b) PCA+LDA−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3

−2

−1.5

−1

−0.5

0

0.5

1

1.5

(c) SSDR-MML Iteration 1−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−1.5

−1

−0.5

0

0.5

1

1.5

2

(d) SSDR-MML Iteration 2−2 −1.5 −1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

0.5

1

1.5

2

(e) SSDR-MML Iteration 3

Fig. 2. Project faces to 2D using different methods. Symbols denote different people, and colors denote attributes.

propagated and the relationship between labels andviews. By examining properties of this structure we candetermine the success of the learning approach.

Our proposed work makes several contributions. Wecreate a reconstruction error based framework that canmodel and create classifiers for complex datasets that aremulti-label and multi-view, and contain missing labels.Multi-view learning involves multiple feature sets for thesame set of objects. Previous work [3], [4] shows thatsimply concatenating all feature sets into one single viewis suboptimal and raises difficult engineering issues if theviews are fundamentally different (such as binary in oneview and real values in another, or dimension and nor-malization issues). Our reconstruction error frameworkmakes the following contributions to the field:• Simultaneously performs dimension reduction,

multi-label propagation in a multi-view setting.• Explicitly models and constructs a sparse graph

showing which data points are related.• Allow a quantification of how successful the learn-

ing process was by examining the properties of thegraph mentioned above (see Section 7).

• Allow the domain experts to clearly understandwhere/how the labels were propagated and how theviews are complimentary using the graph.

• Allow multi-view learning without assuming con-ditional independence of views and that each view(by itself) is sufficient to build a weak classifier. Thisis achieved as we do not serialize the learning prob-lem, instead learning from all views simultaneously.

We begin our paper with a brief review of related stud-ies in Section 2, and then present the general frameworkin Section 3. Sections 4 and 5 show specific formulationsfor multi-label and multi-view learning along with theoptimization algorithm we use for each. Section 6 showshow to perform dimension reduction using our ap-proach, and Section 7 presents new work that describesa method to determine how successful our approachwas in a multi-label/view problem. Section 8 discussesimplementation issues and the corresponding solutions.Our experimental section (Section 9) shows the resultsthat compare our work against existing competing tech-niques. The new experiments include comparing againsta larger set of competing algorithms and verifying theusefulness of our success measure of how well the modelperforms. We conclude our work in Section 10.

Differences to Conference Version. The additionalwork that is in this paper and not the conference version[5] is: (1) extension to handle both multi-label and multi-view learning, (2) a node regularizer to facilitate thelearning with imbalanced labeling, (3) a measure toquantify the success of multi-label and multi-view learn-ing, and (4) extensive new discussions and experiments.

2 RELATED WORK

Our work is related to four machine learning and datamining topics: multi-label learning, multi-view learning,multi-label dimension reduction, and semi-supervised learn-ing. Here we review some related work in the four areas.

Multi-label Learning. Multi-label learning (MLL) ismotivated by the fact that a real world object natu-rally involves multiple related attributes, and therebyinvestigating them together could improve the overalllearning performance. MLL learns a problem togetherwith other related problems at the same time [6], thatallows the learner to use the commonality amongstthe labels. The hope is that by learning multiple labelssimultaneously one can improve performance over the“no transfer” case. MLL has been studied from manydifferent perspectives, such as neural networks amongsimilar tasks [7], kernel methods and regularization net-works [8], modeling task relatedness [9], [10], label setpropagation [11], and probabilistic models in Gaussianprocess [12], [13] and Dirichlet process [14]. AlthoughMLL techniques have been successfully applied to manyreal world applications, their usefulness are significantlyweakened by the underlying relatedness assumption,while in practice some labels are indeed unrelated andcould induce destructive information to the learner. Inthis work, we propose a measure to quantify the successof learning, as to benefit from related labels and rejectthe combining of unrelated (detrimental) labels.

Multi-view Learning. Practical learning problems of-ten involves datasets that are naturally comprised ofmultiple views [15]. MVL learns a problem togetherwith multiple feature spaces at the same time [16], thatallows the learner to perceive different perspectives ofthe data in order to enrich the total information about thelearning task at hand [17], [18]. [19] has shown that theerror rate on unseen test samples can be upper boundedby the disagreement between the classification-decisionsobtained from the independent characterizations of the



3

data. Therefore, as a branch of MVL, co-training [3],[20] aims at minimizing the misclassification rate in-directly by reducing the rate of disagreement amongthe base classifiers. Multiple kernel learning was re-cently introduced by [21], where the kernels are linearlycombined in a SVM framework and the optimizationis performed as an semidefinite program or quadrati-cally constrained quadratic program. [22] reformed theproblems as a block l1 formulation in conjunction withMoreau-Yosida regularization, so that efficient gradientbased optimization could be performed using sequentialminimal optimization techniques while still generating asparse solution. [23] preserved the block l1 regularizationbut reformulated the problem as a semi-infinite linearproblem, which can be efficiently solved by recycling thestandard SVM implementations and made it applicableto large scale problems. Although the successes of MVL,many existing approaches suffer from their own limita-tions: conditional independent assumption is importantfor co-training both theoretically and empirically [24] butit rarely holds in real-world applications; multi-kernelmachines are limited to combining multiple kernels inlinear manners, and such linear scheme sometimes in-duces poor data representations. In contrast, our SSDR-MML framework does not require the conditional inde-pendence assumption, and the multiple views are fusedin a nonlinear fashion.

Multi-label Dimension Reduction. Various dimen-sion reduction methods have been proposed to simplifylearning problems, which generally fall into three cate-gories: unsupervised, supervised, and semi-supervised.In contrast to traditional classification tasks whereclasses are mutually exclusive, the classes in multi-task/label learning are actually overlapped and corre-lated. Thus, two specialized multi-label dimension re-duction algorithms have been proposed in [25] and [26],both of which try to capture the correlations betweenmultiple labels. However, the usefulness of such meth-ods is dramatically limited by requiring complete labelknowledge, which is very expensive to obtain and evenimpossible for those extremely large dataset, e.g. webimages annotation. In order to utilize unlabeled data,there are many semi-supervised multi-label learningalgorithms have been proposed [27] [28], which solvelearning problem by optimizing the objective functionover graph or hypergraph. However, the performance ofsuch approach is weakened by the lack of the concate-nation of dimension reduction and learning algorithm.To the best of our knowledge, [29] is the first attemptto connect dimension reduction and multi-task/labellearning, but it suffers from the inability of utilizingunlabeled data.

Semi-supervised learning. The study of semi-supervised learning is motivated by the fact that whilelabeled data are often scarce and expensive to obtain,unlabeled data are usually abundant and easy to obtain.It mainly aims to address the problem where the labeleddata are too few to build a good classifier by using the

large amount of unlabeled data. Among various semi-supervised learning approaches, graph propagation hasattracted an increasing amount of interest [30]. [31] in-troduces an approach based on a random field modeldefined on a weighted graph over both the unlabeledand labeled data; [32] proposes a classifying functionwhich is sufficiently smooth with respect to the in-trinsic structure collectively revealed by known labeledand unlabeled points. [33] extend the formulation byinducing spectral kernel learning to semi-supervisedlearning, as to allow the graph adaptation during thelabel diffusion process. Another interesting direction forsemi-supervised learning is proposed in [34], where thelearning with unlabeled data is performed in the contextof Gaussian process. The encouraging results of manyproposed algorithms demonstrate the effectiveness ofusing unlabeled data. A comprehensive survey on semi-supervised learning can be found in [35].

3 THE FORMULATION

Notation. Given a set of n instances containing bothlabeled and unlabeled data points, we define a generallearning problem on multiple labels and multiple featurespaces. In the learning problem, there are p relatedlabels T = {t1, t2, · · · , tp}, each of which can be amulti-class learning task with a given finite label set.For the k-th label tk, we define a binary classifyingfunction Fk ∈ Bn×ck on its corresponding label setCk = {1, 2, · · · , ck}. Note that each instance will havep label vectors {f1 . . . fp}, containing the label sets foreach multi-class label. Then fkij = 1 iff for the k-thlabel, the i-th instance belongs to the j-th class (the i-th instance ∈ class Ckj ) and fkij = 0 otherwise. Withoutloss of generality, we assume that the points have beenreordered so that for label k, the first lk points are labeledand the remaining uk points being unlabeled (wheren = lk + uk and typically lk � n), and construct aprior label matrix Yk ∈ Blk×ck using the given labels inlabel tk. Similarly, for each instance, there are q featuredescriptions from different views {x1

i ,x2i , · · · ,x

qi } with

xki being the k-th feature description/view of the i-th instance. Our aim is to create an asymmetric graphG(V,E) over the n instances represented by the weightmatrix W ∈ Rn×n, where we set wii = 0 to avoidself-reinforcement. The diagonal node degree matrix isdenoted by D, where djj =

∑ni=1 wij . To make this paper

more readable, the notations used in this paper obeythe following rule: the superscript is used to denote theindex of a label or view, and the subscript is used todenote the index of an instance, or an entry in a vectoror matrix. This article will often refer to row and columnvectors of matrices, for example, the i-th row and j-th column vectors of W are denoted as wi• and w•j ,respectively. The notations are summarized in Table 1.

The Framework. The key question in multi-label ormulti-view learning is how to incorporate the informa-tion carried by related labels or feature spaces into the



4

TABLE 1Notation Table

Notation Description

ti The ith labelFk, fki• Binary classifying matrix, vector (ith instance) for label kYk,yki• Prior label matrix, vector (ith instance) for label k

xki Feature vector of ith instance at kth viewW,wi• Graph weight matrix, and its ith row vector

D Graph degree matrixVk Node regularizer for label kCk The label set for label kL(z) Local covariance matrix of a vector zn, p, q Number of instances, labels, and views, respectivelylk, uk Number of labeled, unlabeled instances for label kI Identity matrix

1, 0 Vector with all ones (zeros) entriesαk, βk, λ Tuning parameters for view, label, and regularizer

same learning problem. In graph transduction, such kindof information encoding can be expressed in terms ofpartially fitting the graph to all available labels or viewsbased on their importance or relatedness. As before,we solve the label inference problem by following theintuition that “nearby” points tend to have similar labels,and adopt reconstruction error [5], [36]. These nearbypoints are learnt using our algorithm and encoded in W,and we can then use W to propagate the labels of labeledpoints to their neighbors. To ensure that there is no givenlabel is overwritten, we constrain Fkl = Yk. Finally, toensure that the labels are not excessively propagated, weadd a regularization term ‖W‖2F . The general learningframework can then be formulated as:

Q(W,F) =

q∑k=1

αkn∑i=1

‖xki −n∑j=1

wijxkj ‖2F +

p∑k=1

βkn∑i=1

‖fki• −n∑j=1

wijfkj•‖2F + λ‖W‖2F

s.t. ∀i, wi•1 = 1; Fkl = VkYk. (1)

where ‖ • ‖F denotes the Frobenius norm. The first termin Eq. 1 is the reconstruction error over the multiplefeature descriptions of instances, the second term is thereconstruction error over the classifying functions for themultiple labels, and the third term is the L2 regulariza-tion term. Note that the two reconstruction errors shareexactly the same weight matrix W, which enables themultiple labels and views to help each other. The tuningparameter 0 ≤ αk ≤ 1 is determined by the “importance”of feature descriptions at the k-th view, and 0 ≤ βk ≤ 1is decided by the “relatedness” between the label tk toother labels. λ is a tuning parameter of the regularizationterm. We will in Section 7 provide a success measure toguide the selection of these parameters. The objectivefunction consists of the reconstruction errors of featurespaces at different views and multiple related labels, asit allows the graph weight partially fits to each of them.

Overcoming Class Imbalance. If one class is morepopular than another, there is a chance that even thoughthe labels of the less frequent class are propagated, theyare ignored in favor of the popular class. To overcome

this, we introduce the matrix Vk which is a node regu-larizer [37] to balance the influence of different classesin label tk. The matrix Vk = diag(vk) is a function of Yk

(given labels for label k), and Dl is the degree matrix ofthe labeled points calculated from W:

vk =

ck∑i=1

yk•i �Dl1(yk•i)T

Dl1(2)

where � denotes Hadamard product, and 1 =[1, 1, · · · , 1]T . By definition, VkYk is a normalized ver-sion of the label matrix Yk satisfying

∑i(V

kYk)ij = 1for ∀j. The normalized label matrix VkYk enables thehighly connected instances to contribute more duringthe graph diffusion/label propagation process. Since thetotal diffusion of each class is normalized to one, theinfluence of different classes is balanced even if the givenlabels are imbalanced. Equally balanced class generallyleads to more reliable solutions. In Section 4 and 5, wewill explicitly apply the general framework to multi-label and multi-view learning problems, respectively.

We shall now describe specific solutions for the multi-label and multi-view settings separately. The generalizedalgorithm for both settings is shown in Table 2.

4 MULTI-LABEL LEARNING

4.1 FormulationWe start with multi-label learning, where there are mul-tiple labels over the same set of feature descriptions ofinstances. Our motivation is to improve the learningperformance on the multiple labels by making use ofthe commonality among them. Intuitively, multi-labellearning could greatly improve the learning performanceif the multiple labels are highly correlated. On the con-trary, if there is no relatedness between the multiplelabels, multi-label learning cannot be beneficial and evencould be detrimental. We can better understand thispremise by interpreting our work as label propagation.For multi-label learning to be successful, points withthe first label being labeled, should have this label trans-ferred/propagated to points with other labels being labeledand vice-versa. If this propagation is extensive then theapproach will be successful. This is the idea behindthe math of identifying the success of multi-label/viewlearning in section 7. Under multi-label setting, thegeneric framework shown in Eq. (1) is reduced to:

Q(W,F) = α

n∑i=1

‖xi −n∑j=1

wijxj‖2F

+

p∑k=1

βkn∑i=1

‖fki• −n∑j=1

wijfkj•‖2F + λ‖W‖2F

s.t. ∀i, wi•1 = 1; Fkl = VkYk. (3)

4.2 Alternating OptimizationThe formulation shown in Eq. (3) is a minimizationproblem involving two variables to optimize. Since this



5

objective is not convex it is difficult to simultaneouslyrecover both unknowns. However, if we hold one un-known constant and solve the objective for the other,we have two convex problems that can be optimallysolved in closed form. In the rest of this section, wepropose an alternating optimization for the SSDR-MMLframework, which iterates between the updates of Wand Fk until Fk stabilized. The experimental results(see Table 4 and 6) indicate that converging to the localoptima still provides good results and is better than lesscomplex objective functions that are solved exactly.

4.2.1 Update for WIf the classifying function Fk is a constant, then theweight matrix W can be recovered in closed form asa constrained least square problem. Since the optimalweights for reconstructing a particular point is onlydependent on other points, each row of the weightmatrix W can be obtained independently. The problemreduces to minimize the following function:

minwi•

Q(wi•) = α‖xi −wi•X′i‖2F

+

p∑k=1

βk‖fki• −wi•Fki

′‖2F + λ‖wi•‖2F

s.t. wi•1 = 1 (4)

where X′i and Fki′ denote the set difference {X\xi} and

{Fk \ fki•} respectively, i.e. the set of all instances andtheir labels except the ith instance and its labels, and asbefore ‖ • ‖F denotes Frobenius norm. The derivative ofthe cost function with respect to wi• can be written as:

∇wi•Q(wi•) = wi•(α(1xi −X′i

) (1xi −X′i

)T+

p∑k=1

βk(1fki• − Fk

i

′)(1fki• − Fk

i

′)T+ λI) (5)

To provide the solution for W, we first introducethe local covariance matrix. Let L(xi) denote the localcovariance matrix of the feature description xi of theith instance. The term “local” refers to the fact that theinstance is used as the mean of the calculation.

L(xi) = (1xi −X′i) (1xi −X′i)T (6)

where as before xi is a row vector, and 1 is a columnvector with all one entries. Using a Lagrange multiplierto enforce the sum-to-one constraint, the update of wi•(the weights for the ith instance) can be expressed interms of the inverse local covariance matrices.

wi• =1T(αL(xi) +

∑pk=1 β

kL(fki•) + λI)−1

1T(αL(xi) +

∑pk=1 β

kL(fki•) + λI)−1

1(7)

where L(fki•) is defined in the same manner as L(xi)shown in Eq. (6), and I represents the identity matrix. Aspreviously defined, α and β are the tuning parametersfor the views and labels respectively, and λ controlsthe penalty of the L2 norm. The optimal weight matrix

consists of all the row vectors wi• for i = 1, · · · , n. TheL2 norm slightly improves the sparsity of the recon-struction weights, and further sparsity can be obtainedby discarding the neighbors with very small weights,since they are barely effective in the graph diffusion.Note that Eq. (7) can not guarantee the weights are non-negative. We empirically found that the negative weightsare infrequent and generally have small values, and thusonly have little effect to the learning performance. In ourexperiment, we keep both positive and negative weights,as a positive weight indicates two points are similar, andthen a negative weight indicates the opposite.

4.2.2 Update for Fk

In this step, we assume the weight matrix W is constant,then the goal is to fill in the missing labels in Fku. For alabel tk, we relax the binary classifying function Fk tobe real-valued, so that the optimal Fku can be recoveredin closed form. Since the feature reconstruction error(first term of Eq. (3)) is a constant, we can rewrite theformulation in matrix format:

minFk

Q(Fk) = 1

2tr{(

Fk)T

(I −W)T(I −W)Fk

}s.t. Fkl = VkYk (8)

where Yk carries the given labels of the kth label,and Vk is the corresponding the node regularizer. Toexpress the solution in terms of matrix operations, weassume the instances have been ordered so that the firstl are labeled and the remaining u are the unlabeledinstances. We can then split the weight matrix W andclassification function Fk after the lkth row and column,

i.e. W =

[Wll Wlu

Wul Wuu

]and Fk =

[FklFku

]. Note that we

do not attempt to overwrite the labeled instances. Thecost function is convex, thereby allowing us to recoverthe optimal Fk by setting the derivative ∇FkQ(Fk) = 0.

(I −

[Wll Wlu

Wul Wuu

])T (I −

[Wll Wlu

Wul Wuu

])[FklFku

]k= 0 (9)

s.t. Fkl = VkYk.

where 0 is a matrix with all zeros. The optimizationabove yields a large sparse system of linear equationsthat can be solved by a number of standard methods. Themost straightforward one is the closed-form solution viamatrix inversion. The predictions for unlabeled instancesare obtained in closed form via matrix inversion:

Fku =

(WT

luWlu + (Iu −Wuu)T (Iu −Wuu)

)−1

(WT

lu (Il −Wll) + (Iu −Wuu)T Wul

)VkYk (10)

where Il and Iu denote identity matrices with dimensionl and u, respectively. The predictions for an unlabeledinstances Ii can be obtained by setting fkij = 1 wherej = argmaxj f

kij , and other elements in fki• to zeros.



6

4.2.3 Progressive Update for Yk

Since there is no theoretical guarantee that our proposedalternating optimization will converge, it is possible thatthe prediction of the current iteration oscillates andbacktracks from the predicted labellings in previousiterations. A straightforward solution to address thisproblem is to set up a small tolerance, but it is difficultfor a practitioner to set the value of tolerance. Alterna-tively, we propose a progressive method to update Yk

incrementally instead of updating Fk to remove suchunstable oscillations. In each iteration, we only make themost confident prediction, and treat this as the groundtruth in future training. To do this, we consider the priorlabel matrix Yk as an unknown in the cost functionEq. (8). The Choosing of the most confident predictionis guided by the direction with largest negative gradientin the partial derivative ∂Q(Fk,Yk)

∂Fk.

∂Q∂Fk

=

(∂Q∂Fk

)l(

∂Q∂Fk

)u

= (I −W)T(I −W)

[VkYk

0

](11)

For the label k, let (i∗, j∗)k denotes the position of themost confident prediction, which can be decided by find-ing the largest negative value in the partial derivative.

(i∗, j∗)k = argmini,j

(∂Q∂Fk

)u

(12)

In each iteration, we locate the position (i∗, j∗)k in thematrix Fku, and reset the entries in fk(lk+i∗)•. In particular,we set the entry fk(lk+i∗),j∗ to 1 and other entries in the(lk + i∗)-th row to 0, and then update Yk by:

Yk ⇐=[

Yk

fk(lk+i∗)•

](13)

After each prediction, we update lk with lk ⇐ lk + 1,recompute the weight matrix W using Eq. (7) based onthe newly obtained Y, and update the node regularizerV. The whole procedure repeats until all the missinglabels in the p labels are inferred.

5 MULTI-VIEW LEARNING

5.1 FormulationWe now apply our framework to multi-view learning,where there are multiple feature descriptions obtainedfrom different views for the same set of instances.Our goal is to improve learning performance by takingadvantage of the complementary information carriedby the multiple views of data. Under this setting, thegeneral framework in Eq. (1) can be simplified to:

Q(W,F) =

q∑k=1

αkn∑i=1

‖xki −n∑j=1

wijxkj ‖2F

+β

n∑i=1

‖fi• −n∑j=1

wijfj•‖2F + λ‖W‖2F

s.t. ∀i, wi•1 = 1; Fl = VY. (14)

5.2 Alternating Optimization

The optimization problem can be solved using a similarapproach we proposed for multi-label learning.

Update for W. While assuming the classification func-tion F is fixed, the weight matrix W can be recoveredas now a set of constrained least square problems. Giventhe definition of local covariance matrix shown in Eq. (6),the weights wi• for the ith instance can be solved inde-pendently by applying a Lagrange multiplier.

wi• =1T(∑q

k=1 αkL(xki ) + βL(fi•) + λI

)−11T(∑q

k=1 αkL(xki ) + βL(fi•) + λI

)−11

(15)

Update for F. The classifying function F can berecovered using exactly the same method proposed inSection 4.2.2 or 4.2.3. Specifically, the update for F can becalculated using the closed form solution in Eq. (10), orthe progressive solution shown from Eq. (11) to Eq. (13).

6 SPECTRAL EMBEDDING FOR DIMENSIONREDUCTION STEP

In this section we describe an extension to our frame-work which allows dimension reduction to easily beperformed. It can be used as an additional step in theoptimization or a post-processing step after the opti-mization converged. It is useful as a method to moreeasily visualize the results of our algorithm (as done inFig. 2(a)) or when working with high dimensional data(as done for the results shown in Fig. 4(c)).

Since the weight matrix W captures the intrinsic geo-metric relations between data points, dimension reduc-tion can be performed using W. Again that the spectralembedding step is unnecessary for learning purpose, itis only used to dimension reduction. Let d denote thedesired dimension, the dimension reduced instance ximinimizes the embedding cost function:

Q(X) =

n∑i=1

‖xi −n∑j=1

Wijxj‖2 (16)

where X ∈ Rn×d is the dimension reduced data matrix.The embedding cost in Eq. (16) defines a quadratic formin the vector xi. Since we want the problem well-posedand also to avoid trivial solutions, the minimization canbe solved as a sparse eigen decomposition problem:

min Q(X) = tr(XTMX

)(17)

where M = (I −W)T (I −W). The optimal embeddingcan be recovered by computing the smallest d+1 eigen-vectors of the matrix M, and then discard the smallesteigenvector which is an unit vector. The remaining deigenvectors are the optimal embedding coordinates thatminimize equation (16).



7

7 QUANTIFYING THE SUCCESS OF MULTI-LABEL AND MULTI-VIEW LEARNING

Multi-label and Multi-view learning have been success-fully applied to many real-world applications, however,in many data sets the performance is no better than andsometimes even worse than if each classification problemwere solved independently [38]. Consequently, avoidingdestructive fusing of information is an essential elementin multi-label and multi-view learning. However, veryfew approaches produce a measure to determine if thetransfer of knowledge between labels or views wassuccessful. In this section we outline such a measure andlater in Section 9.4 empirically verify its usefulness.

Multi-label Learning. Our measure makes use of ourinterpretation of the SSDR-MML framework as perform-ing label propagation and the mechanism for informa-tion combining W. Let F denote a binary label matrixF ∈ Bn×p, where p is the number of labels defined on aset of n instances. fij = 1 if instance i can be categorizedinto class j, and fij = −1 otherwise. After the trainingstep we have a weight matrix W available built upon theset of instances, where W carries the information learntfrom the multiple labels and views. Since W row-wisesums to one, then W can be viewed as a random walktransition matrix. To quantify the success of multi-labellearning, we define a measure of cross propagation (CP):

CP (W) = FT (W)zF (18)

where z is a positive integer, indicating how many stepsthe labels have to propagate. The resulting CP (W) is ap × p matrix where the entry at i, j can be interpretedas how well label i is propagated to the instances labeled withlabel j. Therefore, the values on the diagonal measure thesuccess of intra-label reconstruction, and the off-diagonalvalues measure the success of inter-label reconstruction. Ithas been widely reported [7], [8] that multi-label learningperforms better when labels are correlated, then the sumof off-diagonal entries denotes how well the knowledgetransfer among multiple labels has occurred.

Multi-vew Learning. In multi-view learning, CP (W)measures how well the class labels are reconstructed usingthe knowledge carried by multiple views. For multi-viewlearning, a relatively large value in CP (W) (compared tothe “single” case) implies that the joint learning of viewswas successful, while a smaller value indicates thatdetrimental view combining has occurred. The proposedsuccess measure not only offers a way to quantify theperformance of joint learning of multiple labels/views,but also can be used to guide the selection of parametersin many existing multi-labels/view learning algorithms.

8 IMPLEMENTATION AND PRAGMATIC ISSUES

In this section, we outline issues that we believe makeimplementation and using of our work easier.

8.1 Uses for Multi-label and Multi-view LearningThe proposed framework is motivated by the fact thatmulti-label and multi-view learning may improve thelearning performance over the “single” case by exploit-ing the complementary knowledge contained in themultiple labels or views. In multi-label learning, higherlearning performance could be obtained if the multiplelabels are highly correlated, while worse performancecould happen if the multiple labels are irrelevant. Inmulti-view learning, intuitively, the multiple views aresupposed to be neither too different nor too similar toeach other. There would not be much gain if the multipleviews are too similar. On the other hand, if the multipleviews are too different, multi-view learning could evenbe harmful. Since we do not want to over-constrainthe graph to any label or view while still absorbingknowledge from each of them, the proposed approach isin fact a moderate solution that partially fits the graphto each label or view in a weighted fashion.

In some practical cases, we may want to regard one ofthe multiple labels or views as the target label or view,and consider the others as the source label or view tohelp it. In that case, we set the βk or αk of the target labelor view to 1, and set the weights of source labels or viewsto lie between 0 and 1. Then, the weights of these sourceslabels and views can be used to encode the relative“relatedness” and “importance” to the main label andview, respectively. For a source label or view, a weight ofvalue 0 indicates that it is “irrelevant” or “contradictory”to the main label or view, while a weight of value 1implies that it is “highly correlated” or “complementary”to the target label or view, respectively.

8.2 Parameter Selection

TABLE 2SSDR-MML Algorithm (progressive)

Input:feature descriptions xki , for i ∈ {1, · · · , n} and k ∈ {1, · · · , q}prior labels matrix Yk, for labeled instances and k ∈ {1, · · · , p}regularization parameters αk, βk and λ.

Training Stage:

initialize: count m = 0, (Yk)0

= Yk, Fkl = (Yk)0.

do{

compute wmi• =1T

(∑qk=1

αkL(xki ) +∑pk=1

βkL(fki•) + λI)−1

1T(∑q

k=1αkL(xk

i) +

∑pk=1

βkL(fki•) + λI

)−11;

update Dm, dmjj =∑ni=1 w

mij ;

update (vk)m

=∑cki=1

(yk•i

)m �Dm1((yk•i

)m)TDm1

;

locate (i∗, j∗)k = arg mini,j

(∂Q

∂Fk

)u

;

set fk(lk+i∗)j∗

= 1, and fk(lk+i∗)j

= 0 for j 6= j∗ ;

update(Yk

)m+1=

(Yk

)mfk(lk+i∗)•

;update Fkl =

(Yk

)m+1, and remove the (lk + i∗)th instance from Fku;

update m = m + 1, lk = lk + 1;

}while(Fku! = φ)

Output:weight matrix W, label predictions Yku for k ∈ {1, · · · , p}.

The general solution for the proposed algorithm canbe implemented as shown in Table 2. There are three



8

parameters in our proposed framework αi (the weightof view i), βj (the weight of label j), and λ (the regu-larization parameter). The regularization term has twoadvantages, one is to simplify the learning model, andthe other one is to avoid excessive label propagation.Empirically, we found that the learning performance ofour approach is not sensitive to λ, which also impliesthat the excessive label propagation can be avoided ifλ is not too small. For the parameters αk and βk, wewish to show that the performance of the algorithmdoes not fluctuate greatly with respect to minor changesin the parameter values. We shall empirically show thestability of the performance of our framework withrespect to the parameters (αk and βk) in Section 9.3.Though the algorithm’s performance is stable, how toset the parameters is still an open question as is thecase for most learning algorithms. Two alternatives areto use the knowledge from domain experts to guideor constrain the selection of parameters, e.g. it is wellknown that the color feature should be more emphasizedin the image classification problem “tomato v.s. washingmachine”. Alternatively, our previously defined measurefor success of learning CP(W) in section 7 could also beused to guide the selection of parameters which is theapproach we use in our experimental section.

8.3 Computational Complexity

The standard implementation of the algorithm shownin Table 2 takes O(mn3) time, where m denotes thenumber of missing labels, and n denotes the numberof instances. Since the fact m < n , the computationcomplexity of our algorithm is dominated by the matrixinversion, where standard methods require O(n3) time.Fig. 3 reports the empirical runtime of our method onSIAM TMC 2007 dataset. The reported time is calculatedon a laptop equipped with Intel i7 2.60 GHz CPU, and 16GB memory. The details of the dataset and experimentalsettings can be found in Section 9. We see from the resultthat the runtime decreases significantly along with thedecrease of the number of missing labels.

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

Percentage of labeled instances per category

Ru

ntim

e (

in s

eco

nd

s)

Runtime on SIAM TMC 2007

Fig. 3. Runtime w.r.t. varying number of missing labels.

To speed up the computation, the weight matrix Wcan be recovered by solving a linear system of equationsas it is much more efficient. In Eq. (7) or Eq. (15)we see that the denominator of the fraction is simply

a constant which rescales the sum of i-th row of Wto 1. Therefore, in practice, a more efficient way torecover the optimal wi• is to solve a linear systemof equations, and then rescale the sum of weights toone. Let Li denote the mixed local covariance matrix(∑q

k=1 αkL(xki ) +

∑pk=1 β

kL(fki•) + λI), wi• can be re-

covered by solving Liwi• = 1, and then rescale the sumof wi• to 1. When a local covariance matrix is singular,the linear system of equations can be conditioned byadding a small multiple of the identity matrix Li ←Li +

ξtr(Li)k I , where k denotes the number of neighbors

for each instance, and ξ is a small value (ξ � 0.1).

8.4 Extensions for Noisy Prior LabelsWe propose an extension to our method that allows someincorrect labels to be ignored. Since we previously as-sumed that all the initial labels are accurate, the solutionof Fk provided in Eq. (10) or Eq. (11) suffers from theproblem that there may be considerable noise scatteredin labeled data. A reasonable solution to address thisis to relax the inference objective function by replacingthe constraint on the given labels with an inconsistentpenalty term, namely local fitting penalty [32] allowingpartial neglect of the given labels. We now expand theprior label matrix Yk to be a n×ck matrix, and fill in themissing label locations with zeros. To relax the inferenceproblem shown in Eq. (8), we add in the inconsistentpenalty term and rewrite the cost function as

minFk

Q(Fk,Y

k) =

1

2tr{(Fk)T

(I −W)T(I −W)F

k

+γ(Fk −V

kYk)T (

Fk −V

kYk)} (19)

where γ (> 0) is a tuning parameter to balance theinfluence of label reconstruction error and local fittingpenalty. If we set γ =∞, the cost function will reduce toEq. (8). The cost function is convex and unconstrained,then the update for Fk (Eq. (10)) can be rewritten as

∂Q∂Fk

= (I −W)T (I −W)Fk + γ(Fk −Yk

)= 0 =⇒

Fk =(

1γ (I −W)T (I −W) + I

)−1VkYk (20)

Accordingly, in this relaxed version of label infer-ence, the progressive update for Yk will also change.Since the prior label matrix Yk is included in thecost function Eq. (19), the optimization problem is nowover both the classifying function Fk and the priorlabel matrix Yk, mathematically minFk,Yk Q(Fk,Yk).Therefore, we replace Fk in the cost function Eq. (19)with its optimal solution shown in Eq. (20), let A =(

1γ (I −W)

T(I −W) + I

)−1, the optimization problem

over Yk can be formulated as

Q(Yk) =

1

2tr{(AV

kYk)T

(I −W)T(I −W)

(AV

kYk)

+γ(AV

kYk −V

kYk)T (

AVkYk −V

kYk)}

=1

2tr{(VkYk)T

[AT(I −W)

T(I −W)A

+γ(A− I)T (A− I)](VkYk)} (21)



9

Let B = AT (I −W)T(I −W)A+ γ(A− I)T (A− I),

we write the gradient ofQwith respect to VkYk as showin Eq. (22), which substitutes Eq. (11).

∂Q∂VkYk

=(BTB

)VkYk (22)

Base on ∂Q∂Yk = ∂VkYk

∂Yk∂Q

∂VkYk and basics in algebra,we know that the optimization of Q with respect toYk is equivalent to the optimization over VkYk. Conse-quently, the most confident prediction is located at theposition shown in Eq. (23), which substitutes Eq. (12).

(i∗, j∗)k = argmini,j

(∂Q

∂VkYk

)u

(23)

9 EMPIRICAL STUDY

In this section, we empirically evaluate our frameworkSSDR-MML and its success measure on several real-world applications under multi-label/view settings.

9.1 Multi-label Learning

9.1.1 Experimental SettingsWe compare the performance of our SSDR-MML ap-proach against five baseline methods: (1) RankSVM[39], a state-of-the-art supervised multi-label classifica-tion algorithm based on ranking the results of supportvector machine (SVM); (2) ML-GFHF, the multi-labelversion (two-dimensional optimization) of the harmonicfunction [27]; (3) Regularized MTL [8], a regularizedmulti-task learning method, which assumes all predict-ing functions come from a Gaussian distribution; (4)AdaBoost-MH [40], which is an extension of AdaBoostfor multi-label data that tries to minimize hammingloss; (5) BR-RDT [41], Binary Relevance based Ran-dom Decision Tree, which is an ensemble method. InRankSVM, we choose RBF kernel function (σ is selectedusing cross validation), and fix the penalty coefficientC = 1000. For ML-GFHF, we construct a k-NN (k = 15)graph similarity via RBF kernel function with lengthscale σ selected using cross validation. For AdaBoost-MH, the number of boosting rounds parameter is setto 100. For BR-RDT, there are 100 trees constructed intotal, the maximum depth allowed to be the half ofthe number of features, and the minimal instances ona leaf node is 10. In our SSDR-MML approach, we setthe regularization parameter λ = 1 and determine theimportance of each label by maximizing the learningsuccess measure, βi = argmaxβi CP(W). For fairness,the parameters (λi) in Regularized MTL are set to thevalues that are equivalent to the parameter setting ofSSDR-MML. We adopt micro-averaged F1 measure (F1

Micro) [42] to evaluate the relative performance, whichis a standard evaluation method for multi-label learning.Dataset. We evaluate the performance of our multi-labelmethod on three different types of real world datasets.

• Yeast [43]: consists of 2, 417 gene samples, each ofwhich belongs to one or several of 14 distinct func-tional categories, such as transcription, cell com-munication, protein synthesis, Ionic Homeostasis,and etc. The feature descriptions of Yeast datasetare extracted by different sequence recognition al-gorithms, and each gene sample is represented in a103 dimensional space. The tasks on Yeast dataset areto predict the localization site of protein, where eachsample is associated with 4.24 labels on average.

• Scene: image dataset consists of 2, 407 natural sceneimages, each of which is represented as a 294-dimensional feature vector and belongs to one ormore (1.07 in average) of 6 categories (beach, sunset,fall foliage, field, mountain, and urban).

• SIAM TMC 2007: text dataset for SIAM Text MiningCompetition 2007 consisting of 28, 596 text samples,each of which belongs to one or more (2.21 in aver-age) of 22 categories. To simplify the problem, in ourexperiments we randomly select a subset containing3, 000 samples from the original dataset, then usebinary term frequencies and normalize each instanceto unit length (30, 438-dimensional vector).

We chose these three multi-label datasets because thatrepresent a range of situations, most instances in Yeasthave more than one label, while most instances in Scenehave only one label, and SIAM TMC 2007 data is in avery high dimensional space.

9.1.2 Empirical ResultTo comprehensively compare our proposed algorithmwith the five baseline methods, we train the algorithmson data with varying numbers of labeled instances. Ineach trial, for each label we randomly select a portionof instances from the dataset as the training data, whilethe remaining unlabeled ones are used for testing. Theportion of labeled data gradually increases from 2.5%to 50% with a step size of 2.5%. In Fig. 4, we reportthe average F1 Micro scores and standard deviations, allof which are based on the average over 50 trials. Forthe three real-world datasets that we explored in theexperiments, when all experimental results (regardless ofstep size) are pooled together, we see that the proposedapproach SSDR-MML performs significantly better thanthe competing algorithms in terms of both lower errorrate and standard deviation. Compared to the result inFig. 4(a) and 4(b), we observe from Fig. 4(c) that theperformance of our method is especially good on thehigh-dimensional data SIAM TMC 2007. This impliesthat on high-dimensional data typical multi-label meth-ods would fail, due to the lack of the connection betweendimension reduction and learning algorithm. By thepromising performance of the proposed method shownin Fig. 4(c), where we iteratively perform learning andspectral embedding on the SIAM TMC 2007 dataset, wedemonstrate the effectiveness of connecting dimensionreduction and learning. This is especially useful whenthe data are in the high-dimensional space.



10

0 5 10 15 20 25 30 35 40 45 500.57

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67


F1 M

icro

Multi−label learning on Yeast dataset

RankSVM

ML−GFHF

Regularized MTL

AdaBoost−MH

BR−RDT

SSDR−MML

(a) Yeast

0 5 10 15 20 25 30 35 40 45 50

0.35

0.4

0.45

0.5

0.55

0.6


F1 M

icro

Multi−label learning on Scene dataset

RankSVM

ML−GFHF

Regularized MTL

AdaBoost−MH

BR−RDT

SSDR−MML

(b) Scene

0 5 10 15 20 25 30 35 40 45 500.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65


F1 M

icro

Multi−label learning on TMC 2007 dataset

RankSVM

ML−GFHF

Regularized MTL

AdaBoost−MH

BR−RDT

SSDR−MML

(c) SIAM TMC 2007

Fig. 4. Learning performance measured by F1 Micro score w.r.t. different numbers of labeled instances

9.2 Multi-view Learning9.2.1 Experimental SettingsWe evaluate our framework for multi-view learning ontwo applications: (1) image classification on Caltech [44]dataset; and (2) a set of UCI benchmarks. We comparethe performance of our SSDR-MML approach againstthree baseline models: (1) SVM, standard support vectormachine [45] on each view of the data separately. (2)SKMsmo, multiple kernel learning [22] solved by sequen-tial minimal optimization; (3) Bayesian Co-Training (BCT)[20], a Gaussian process consensus learner. To further un-derstand our framework, we also compare SSDR-MMLagainst two of its variates, i.e. (i) SSDR-MML Simple:SSDR-MML on single-viewed data which is obtained bysimply joining the two view features into a single view;and (ii) SSDR-MML nonsparse, our SSDR-MML imple-mentation without sparsity enforcement, which meansexcessive label propagation could happen. For both SVMand SKMsmo, the kernel is constructed using RBF withlength scale σ = 1

n(n−1)∑ni=1

∑j 6=i ‖ xi − xj ‖, and the

penalty coefficient C0 is fixed to 100. For Bayesian Co-Training (BCT), we choose a standard Gaussian process(GP) setting: zeros mean function, Gaussian likelihoodfunction, and isotropic squared exponential covariancefunction without latent scale. The hyperparameters inGP models can be adapted to the training data, thusthe predictions of Bayesian Co-Training can be read-ily obtained by consenting the predictions of GP fromdifferent views using the learned noise variances. Toavoid overfitting, the number of function evaluations ingradient-based optimization is limited to a maximum of100. In our SSDR-MML approach, we set λ = 1, andthe importance of each view is decided by finding themaximal success measure, αi = argmaxαi CP(W).

9.2.2 Color-aided SIFT MatchingCaltech-256 image dataset [44] consists of 30, 608 imagesin 256 categories, where each category contains around100 images on average. We conduct the experimentson ten binary labels that are randomly selected fromCaltech-256, as summarized in Table 3. Since images are

difficult to describe, there are several standard meth-ods to extract features from an image, such as edgedirection, and color or visual word histogram. However,there is no straightforward answer to tell what kindof features can outperform others, as the performanceof each type of features highly depends on the specificapplications. Thus, it could be desirable if we can makeuse of multiple image descriptors to help the learner.In the experiment, we exploit two image descriptionsfor each image: (1) color histogram, a representation ofthe distribution of colors in an image; and (2) visualword histogram, which is based on the successful imagedescriptor SIFT [46]. Although SIFT can accurately detectand describe interesting points in an image, it suffersfrom the limitation that information carried by color isignored. Consequently, higher learning accuracy can beexpected if we are able to perform learning on both SIFTand color simultaneously. In the preprocessing of thedata, we construct the color histogram (80 bins in HSVcolor space) by counting the number of pixels that havecolors in each of a fixed list of color ranges that spanthe color space. To produce the visual word histogram,we first build a visual vocabulary (800 visual words) byclustering all the SIFT descriptions collected from theimage dataset, and then the histograms can be obtainedby mapping images to the visual codebook, which isoften called bag-of-word.

TABLE 3Ten binary labels selected from Caltech-256

Label Binary Label Data Size1 binocular vs killer-whale 216 vs 912 breadmaker vs telephone-box 141 vs 843 eyeglasses vs skyscraper 82 vs 954 fireworks vs self-propelled lawnmower 100 vs 1205 mars vs saturn 155 vs 926 pci-card vs sunflower 105 vs 807 swiss-army-knife vs telephone-box 109 vs 848 tennis-ball vs zebra 98 vs 969 tomato vs washing machine 103 vs 8410 video-projector vs watermelon 97 vs 93

We compare the performance of our approach against



11

TABLE 4Mean error rates and standard deviations (in percentage)of the ten binary labels from Caltech-256. The best result

is shown in bold.Methods Label 1 Label 2 Label 3

SVM-SIFT 14.05±3.75 13.77±3.97 23.80±8.00SVM-color 11.45±3.16 15.44±3.86 19.81±3.76SKMsmo 11.26±3.20 15.20±3.89 19.84±3.58

BCT 11.63±3.32 14.78±3.88 19.85±3.28SSDR-MML Simple 11.73±3.63 13.95±3.91 21.27±5.21

SSDR-MML nonsparse 15.09±5.37 16.31±4.07 23.83±8.22SSDR-MML 9.14±2.55 12.08±3.15 19.87±3.26

Methods Label 4 Label 5 Label 6SVM-SIFT 7.11±2.47 24.54±6.02 11.77±4.77SVM-color 5.55±1.91 29.72±4.96 15.04±3.98SKMsmo 5.52±1.89 25.17±3.13 14.55±3.93



Methods Label 7 Label 8 Label 9SVM-SIFT 16.39±2.72 8.91±1.81 25.27±6.70SVM-color 13.45±3.48 23.50±5.22 17.80±4.54SKMsmo 13.40±3.50 22.56±5.06 17.76±4.72



Methods Label 10SVM-SIFT 18.95±6.68SVM-color 12.94±4.38SKMsmo 12.57±4.17

BCT 15.24±6.14SSDR-MML Simple 16.43±5.02

SSDR-MML nonsparse 16.27±6.18SSDR-MML 12.14±4.90

the three baseline techniques on the ten predefinedbinary labels. In each trial, we randomly select 10% ofthe images as the training set, and the rest of the imagesare used as the test set. The experiment are repeated 50times, and the mean error rates and standard deviationsare reported in Table 4. It can be seen that in generalthe proposed SSDR-MML approach outperforms all thecompeting techniques. Moreover, it performs statisticallysignificantly better than baseline models with both lowermean error rate and standard deviation. In addition, aswe can observe from the result, the multi-view learningalgorithms generally perform better than any of thesingle-view learners. This confirms the motivation ofmulti-view transfer that using the knowledge carried byall available feature descriptions to improve the perfor-mance of the learner. As expected, SIFT is more discrimi-native on some labels, while color is more discriminativeon the other labels. Sometimes, the performances of SIFTand color feature are dramatically different. For example,in the binary label “tennis-ball vs zebra” color featureis much reliable than SIFT, while in the binary label“video-projector vs watermelon” SIFT performs signif-icantly better than color. This demonstrate the necessity

of multi-view transfer, as it is generally difficult for theusers to tell which kind of features can outperform oth-ers. By comparing the performance of SSDR-MML withits two variates, we see that (1) simply placing featuresfrom multiple views into a single view does not workwell due to the increased dimension and normalizationissues and (2) Sparsity need to be enforced in semi-supervised settings since excessive label propagation isgenerally detrimental and unreliable.

9.2.3 UCI Benchmarks

The second evaluation of multi-view learning is carriedout on a set of UCI benchmarks [47], namely (1)Adult(subset): extracted from the census bureau database; (2)Diabetes: contains the distribution for 70 sets of datarecorded on diabetes patients; (3) Ionosphere: radar datawas collected by a system that consists of a phasedarray of 16 high-frequency antennas with a total trans-mitted power on the order of 6.4 kilowatts; (4) LiverDisorders: attributes are collected from blood tests whichare thought to be sensitive to liver disorders; (5) Sonar:contains signals obtained by bouncing sonar signals offa metal cylinder at various angles and under variousconditions.; and (6) SPECT Heart: describes diagnosing ofcardiac Single Proton Emission Computed Tomography(SPECT) images. We chose these six datasets since theyhave been widely used in the evaluations of variouslearning algorithms. To create two views for each of thedatasets, we equally divide the features of each datasetinto two disjoint subsets such that they are related butdifferent, and thus each subset can be considered asone view. The details of the six UCI benchmarks aresummarized in Table 5.

TABLE 5Statistics of UCI benchmarks

Dataset Name Instance No. View 1 Feature No. View 2 Feature No.

Adult (subset) 1, 605 60 59Diabetes 768 4 4

Ionosphere 351 17 17Liver Disorders 345 3 3

Sonar 208 30 30SPECT Heart 270 7 6

We follow the previously stated methodology, where10% of the samples are used for training and the re-maining 90% for testing, to evaluate the performance ofour SSDR-MML framework on multi-view setting. Theresulting mean error rates and standard deviations onthe six UCI benchmarks, which are based on 50 randomtrials, are reported in Table 6. It can be observed thatthe multiple view learning techniques generally out-perform the single view classifiers, which substantiatesthe benefits of learning with multiple views. Amongall techniques evaluated in the experiment, our SSDR-MML approach performs statistically significantly betterthan all competitors with both lower misclassificationrates and standard deviations. The performance of the



12

proposed framework on multi-view learning tasks notonly demonstrates the effectiveness of our approach, butalso validates the advantage of simultaneous learning ofmultiple feature descriptions. On the two variates of ourapproach, we also see that excessive label propagationand simply joining the multiple views into one could beharmful to the learning performance, which confirms theconclusion made in Section 9.2.2.

TABLE 6Mean error rates and standard deviations (in percentage)

on UCI benchmarks. The best result is shown in bold.Methods Adult(subset) Diabetes Ionosphere

SVM-SIFT 22.37±1.50 35.22±2.04 10.70±4.73SVM-color 24.23±1.55 35.84±3.14 19.57±5.33SKMsmo 21.24±1.32 32.71±2.88 12.55±4.86



Methods Liver Disorders Sonar SPECT HeartSVM-SIFT 45.87±3.43 32.91±4.97 36.47±6.66SVM-color 45.05±3.83 34.91±3.64 26.77±4.75SKMsmo 45.47±3.66 33.18±4.90 29.07±4.22



Based on the promising experiment results, we con-clude that the proposed SSDR-MML framework is ad-vantageous when (1) there are multiple feature descrip-tions available for each instance that are neither toodifferent nor too similar; or (2) the multiple labels arehighly related; or (3) the data is sparsely labeled and inhigh dimensional space.

9.3 Stability to Parameters

We empirically show the stability of the performance ofour framework with respect to the parameters (αk andβk) in Fig. 5. We first pick two highly related labels,Cell Growth, Cell Division, DNA synthesis v.s. Tran-scription, from Yeast dataset (gene dataset with multiplelabels), on which we show the learning performance ofour framework in Fig. 5(a) using F1 Micro score w.r.t.different settings of βk (0.1 ≤ βk ≤ 1), where β1 and β2

are two parameters used to balance the influence of thetwo labels. We then choose a subset (swiss-army-knifev.s. telephone-box) from Caltech-256 image dataset, andconstruct two views from the images: (1) visual word his-togram using SIFT visual feature where color is ignoredby definition; and (2) color histogram. Fig. 5(b) shows theerror rate of our framework on the binary label learningw.r.t. different settings of αk (0.1 ≤ αk ≤ 1), where α1

and α2 are two parameters used to balance the influenceof the two views. In both Fig. 5(a) and 5(b), we see thatlearning performance surface is relatively smooth.

0

0.2

0.4

0.6

0.8

1 0

0.2

0.4

0.6

0.8

10.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

β2

Percentage of labeled instances = 30 %

β1

F1 M

icro

(a) Multi-label learning on Yeast

0

0.2

0.4

0.6

0.8

1 00.2

0.40.6

0.81

0.05

0.1

0.15

0.2

0.25

0.3

0.35

α2

Percentage of labeled instances = 30 %

α1

Err

or

Rate

(b) Multi-view learning on Caltech-256

Fig. 5. Learning performance of SSDR-MML w.r.t. differ-ent parameter settings

9.4 Empirical Evaluation of Our Success Measure

We shall now illustrate the use of the proposed suc-cess measure. We apply our approach to two real-world datasets: Yeast gene dataset and Caltech-256 im-age dataset. Since our algorithms are deterministic, wecan obtain a variety of experiments by changing the pa-rameters αk or βk. We collect a set of learning accuraciesand the corresponding values of CP (W) under thesedifferent parameterizations, and normalize all of themto lie between 0 and 1. In this experiment we adopttwo labels (DNA synthesis and transcription) for multi-label learning on Yeast, and two views (visual word andcolor) for multi-view learning on Caltech-256. Fig. 6(a)shows the learning accuracy v.s. the value of FTWzF(sum of off-diagonal values) under multi-label settingon Yeast dataset with respect to different βk’s. Fig. 6(b)shows the normalized learning accuracy v.s. the valueof FTWzF (naturally a single value) under multi-viewsetting on Caltech-256 dataset with respect to differentparameterizations of αk. The results shown in the twofigures are quantified and averaged from 100 randomtrials (per parameterization). The red solid curve showsthe results obtained using our SSDR-MML algorithm,while the black dash line marks the perfect proportionalrelation. Observing the results, in both settings we seethat the value of FTWzF is approximately linear tothe learning accuracy. Therefore, we conclude that a



13

parameterization with relatively higher value of FTWzFwill generally lead to a better learning result, and thusαk or βk should be set to the one which corresponds tothe highest value of FTWzF.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Learning Accuracy

No

rma

lize

d F

TW

zF

Normalized Learning Accuracy VS. FTW

zF on Yeast dataset

(a) Multi-label learning on Yeast

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normalized Learning Accuracy

No

rma

lize

d F

TW

zF

Normalized Learning Accuracy VS. FTW

zF on Caltech−256 dataset

(b) Multi-view learning on Caltech-256

Fig. 6. Two examples showing the approximately pro-portional relation between the learning accuracy and thevalue of CP (W).

10 CONCLUSIONAs applications in machine learning and data miningmove towards demanding domains, they must movebeyond the restrictions of complete supervision, single-label, single-view and low-dimensional data. In thispaper we present a joint learning framework basedon reconstruction error, which is designed to handleboth multi-label and multi-view learning settings. Itcan be viewed as simultaneously solving for two setsof unknowns: filling in missing labels and identifyingprojection vectors that makes points with similar labelsclose together and points with different labels far apart.As to improve the learning performance, the underlyingobjective is to enable the learner to combine knowl-edge from multiple labels and views by partially fittingthe graph to each of them. Empirically, our proposedapproach was shown to give more reliable results onreal world applications with both lower error rate andstandard deviation compared to the baseline models, asshown in Section 9. Perhaps the most useful part ofour approach is that since the mechanism for combiningknowledge is explicit it also offers the ability to measurethe success of learning.

ACKNOWLEDGMENTS

The authors gratefully acknowledge support of this re-search from ONR grants N00014-09-1-0712, N00014-11-1-0108 and NSF Grant NSF IIS-0801528.

REFERENCES[1] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York:

Springer-Verlag, 2002.[2] P. N. Belhumeur, J. P. Hespanha, P. Hespanha, and D. J. Kriegman,

“Eigenfaces vs. fisherfaces: Recognition using class specific linearprojection,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 19, pp. 711–720, 1997.

[3] A. Blum and T. Mitchell, “Combining labeled and unlabeleddata with co-training,” in Proceedings of the eleventh annualconference on Computational learning theory, ser. COLT’ 98. NewYork, NY, USA: ACM, 1998, pp. 92–100. [Online]. Available:http://doi.acm.org/10.1145/279943.279962

[4] D. Yarowsky, “Unsupervised word sense disambiguationrivaling supervised methods,” in Proceedings of the 33rdannual meeting on Association for Computational Linguistics,ser. ACL ’95. Stroudsburg, PA, USA: Association forComputational Linguistics, 1995, pp. 189–196. [Online]. Available:http://dx.doi.org/10.3115/981658.981684

[5] B. Qian and I. Davidson, “Semi-supervised dimension reductionfor multi-label classification,” in AAAI’10: In Proceedings of the24rd National Conference on Artificial Intelligence. Menlo Park,California, USA: The AAAI Press, 2010, pp. 569–574.

[6] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group-sparse structures for robust multi-task learning,” in KDD, 2011,pp. 42–50.

[7] R. Caruana, “Multitask learning: A knowledge-based source ofinductive bias,” in Proceedings of the Tenth International Conferenceon Machine Learning. Morgan Kaufmann, 1997, pp. 41–48.

[8] T. Evgeniou and M. Pontil, “Regularized multi–task learning,”in KDD ’04: Proceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining, New York, NY,USA, 2004, pp. 109–117.

[9] S. Ben-David and R. Schuller, “Exploiting task relatedness formultiple task learning,” in COLT, 2003, pp. 567–580.

[10] H. Fei and J. Huan, “Structured feature selection and task rela-tionship inference for multi-task learning,” in ICDM, 2011, pp.171–180.

[11] X. Kong, M. K. Ng, and Z.-H. Zhou, “Transductive multilabellearning via label set propagation,” IEEE Trans. Knowl. Data Eng.,vol. 25, no. 3, pp. 704–719, 2013.

[12] K. Yu, V. Tresp, and A. Schwaighofer, “Learning gaussian pro-cesses from multiple tasks,” in ICML, 2005, pp. 1012–1019.

[13] E. Bonilla, K. M. Chai, and C. Williams, “Multi-task gaussianprocess prediction,” in Advances in Neural Information ProcessingSystems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds.Cambridge, MA: MIT Press, 2008, pp. 153–160.

[14] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-tasklearning for classification with dirichlet process priors,” J. Mach.Learn. Res., vol. 8, pp. 35–63, May 2007. [Online]. Available:http://dl.acm.org/citation.cfm?id=1248659.1248661

[15] B. Qian, X. Wang, J. Ye, and I. Davidson, “Multi-objective multi-view spectral clustering via pareto optimization,” in SDM, 2013,pp. 234–242.

[16] X. Liu, S. Ji, W. Glanzel, and B. D. Moor, “Multiview partitioningvia tensor methods,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5,pp. 1056–1069, 2013.

[17] G. Li, K. Chang, and S. C. H. Hoi, “Multiview semi-supervisedlearning with consensus,” IEEE Trans. Knowl. Data Eng., vol. 24,no. 11, pp. 2040–2051, 2012.

[18] X. Wang, B. Qian, and I. Davidson, “Improving document clus-tering using automated machine translation,” in CIKM ’12: Pro-ceedings of the 21st ACM International Conference on Information andKnowledge Management, 2012, pp. 645–653.

[19] S. Dasgupta and M. Littman, “Pac generalization bounds for co-training,” in Neural Information Processing Systems, 2002.

[20] S. Yu, B. Krishnapuram, R. Rosales, H. Steck, and R. B. Rao,“Bayesian co-training,” in Advances in Neural Information Process-ing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds.Cambridge, MA: MIT Press, 2008, pp. 1665–1672.



14

[21] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui,and M. I. Jordan, “Learning the kernel matrix with semidefiniteprogramming,” Journal of Machine Learning Research (JMLR),vol. 5, pp. 27–72, December 2004. [Online]. Available:http://portal.acm.org/citation.cfm?id=1005332.1005334

[22] F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan, “Multiple kernellearning, conic duality, and the smo algorithm,” in Proceedingsof the twenty-first international conference on Machine learning, ser.ICML ’04. New York, NY, USA: ACM, 2004, pp. 6–. [Online].Available: http://doi.acm.org/10.1145/1015330.1015424

[23] S. Sonnenburg, G. Ratsch, and C. Schafer, “A General and EfficientMultiple Kernel Learning Algorithm,” in Advances in Neural Infor-mation Processing Systems 18, Y. Weiss, B. Scholkopf, and J. Platt,Eds. Cambridge, MA: MIT Press, 2006, pp. 1273–1280.

[24] C. X. Ling, J. Du, and Z.-H. Zhou, “When does co-training workin real data?” in Proceedings of the 13th Pacific-Asia Conference onAdvances in Knowledge Discovery and Data Mining, ser. PAKDD ’09.Berlin, Heidelberg: Springer-Verlag, 2009, pp. 596–603.

[25] Y. Zhang and Z.-H. Zhou, “Multi-label dimensionality reductionvia dependence maximization,” in AAAI’08: Proceedings of the 23rdNational Conference on Artificial Intelligence, 2008, pp. 1503–1505.

[26] K. Yu, S. Yu, and V. Tresp, “Multi-label informed latent semanticindexing,” in Proceedings of the 28th annual international ACMSIGIR conference on Research and development in information retrieval,2005, pp. 258–265.

[27] G. Chen, Y. Song, F. Wang, and C. Zhang, “Semi-supervised multi-label learning by solving a sylvester equation,” in Proceedings ofthe SIAM International Conference on Data Mining, 2008, pp. 410–419.

[28] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multi-label classification,” in Proceedings of the 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,2008, pp. 668–676.

[29] S. Ji and J. Ye, “Linear dimensionality reduction for multi-labelclassification,” in Proceedings of the 21st international jont conferenceon Artifical intelligence, 2009, pp. 1077–1082.

[30] B. Qian, X. Wang, F. Wang, H. Li, J. Ye, and I. Davidson, “Activelearning from relative queries,” in IJCAI, 2013.

[31] X. Zhu, Z. Ghahramani, and J. D. Lafferty, “Semi-supervisedlearning using gaussian fields and harmonic functions,” in ICML,2003, pp. 912–919.

[32] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf,“Learning with local and global consistency,” in Advances inNeural Information Processing Systems, 2004, pp. 321–328.

[33] W. Liu, B. Qian, J. Cui, and J. Liu, “Spectral kernel learning forsemi-supervised classification,” in IJCAI’09: Proceedings of the 21stinternational jont conference on Artifical intelligence. San Francisco,CA, USA: Morgan Kaufmann Publishers Inc., 2009, pp. 1150–1155.

[34] N. D. Lawrence and M. I. Jordan, “Semi-supervised learning viagaussian processes,” in Advances in Neural Information ProcessingSystems 17, L. K. Saul, Y. Weiss, and L. Bottou, Eds. Cambridge,MA: MIT Press, 2005, pp. 753–760.

[35] X. Zhu, “Semi-Supervised Learning Literature Survey,” ComputerSciences, University of Wisconsin-Madison, Tech. Rep., 2005.

[36] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reductionby locally linear embedding,” Science, vol. 290, pp. 2323–2326,2000.

[37] J. Wang, T. Jebara, and S. fu Chang, “Graph transduction viaalternating minimization,” in Proc. 25th ICML, 2008.

[38] M. T. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich,“To transfer or not to transfer,” in In NIPS05 Workshop, InductiveTransfer: 10 Years Later, 2005.

[39] A. E. Elisseeff and J. Weston, “A kernel method for multi-labelledclassification,” in In Advances in Neural Information ProcessingSystems 14, 2001, pp. 681–687.

[40] R. E. Schapire and Y. Singer, “Boostexter: A boosting-based sys-temfor text categorization,” Mach. Learn., vol. 39, no. 2-3, pp. 135–168, May 2000.

[41] X. Zhang, Q. Yuan, S. Zhao, W. Fan, W. Zheng, and Z. Wang,“Multi-label Classification without the Multi-label cost,” in Pro-ceedings of the Tenth SIAM International Conference on Data Mining,Apr. 2010.

[42] Y. Yang, “An evaluation of statistical approaches to text catego-rization,” Journal of Information Retrieval, vol. 1, pp. 67–88, 1999.

[43] A. Elisseeff and J. Weston, “A Kernel Method for Multi-LabelledClassification,” in Annual ACM Conference on Research and Devel-opment in Information Retrieval, 2005, pp. 274–281.

[44] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” California Institute of Technology, Tech. Rep. 7694,2007. [Online]. Available: http://authors.library.caltech.edu/7694

[45] C.-C. Chang and C.-J. Lin, LIBSVM: a libraryfor support vector machines, 2001. [Online]. Available:http://www.csie.ntu.edu.tw/ cjlin/libsvm

[46] D. G. Lowe, “Distinctive image features from scale-invariantkeypoints,” Int. J. Comput. Vision, vol. 60, pp. 91–110, November2004.

[47] A. Frank and A. Asuncion, “UCI machine learning repository,”2010. [Online]. Available: http://archive.ics.uci.edu/ml

Buyue Qian is currently a research scientist atIBM T. J. Watson Research. He received hisPhD in 2013 from Computer Science Depart-ment, University of California at Davis. Beforethat, he received Master of Science (2009) fromColumbia University, and BS in Information En-gineering (2007) from Xi’an Jiaotong University.The major awards he received include Yahoo!Research Award, IBM Eminence & ExcellenceAward, and SIAM Data Mining’13 Best ResearchPaper Runner Up.

Xiang Wang is currently a research scientistat IBM T. J. Watson Research. He receivedhis PhD in 2013 from Computer Science De-partment, University of California at Davis. Be-fore that, he received Master of Software Engi-neering (2008) and BS in Mathematics (2004)from Tsinghua University. The major awardshe received include SIAM Data Mining’13 BestResearch Paper Runner Up, IBM InventionAchievement Award, and UC Davis Best Grad-uate Researcher Award.

Jieping Ye is an Associate Professor of Com-puter Science and Engineering at Arizona StateUniversity. He received his Ph.D. in ComputerScience from University of Minnesota, TwinCities in 2005. His research interests includemachine learning, data mining, and biomedi-cal informatics. He won the outstanding studentpaper award at ICML in 2004, the SCI YoungInvestigator of the Year Award at ASU in 2007,the SCI Researcher of the Year Award at ASU in2009, the NSF CAREER Award in 2010, and the

KDD best research paper award honorable mention in 2010.

Ian Davidson grew up in Australia and obtainedhis Ph.D. at Monash University, Melbourne. Hejoined the University of California Davis in 2007and is now a Professor of Computer Science.He directs the Knowledge Discovery and DataMining Lab and his work is supported by grantsfrom NSF, ONR, OSD Google and Yahoo!. He isa senior member of the IEEE.



a reconstruction error based framework for multi …davidson/publications/tkde...male, bearded,...

Documents