semi-supervised object detector learning from minimal...

Semi-supervised Object Detector Learning from Minimal Labels

Sudeep [email protected]

Abstract

While traditional machine learning approaches to classi-fication involve a substantial training phase with significantnumber of training examples, in a semi-supervised setting,the focus is on learning the trends in the data from a limitedtraining set and simultaneously using the trends learnedto label unlabeled data. This report focuses on a particu-lar semi-supervised learning technique called Graph-basedRegularized Least Squares (LapRLS) that can learn fromboth labeled and unlabeled data to build a robust objectdetector capable of learning purely from a single trainingexample and a reasonably large set of unlabeled exam-ples. The algorithm employs multiple feature descriptorssuch as GIST and Pyramid Histogram of Oriented Gradi-ents (PHOG) for learning these trends. Performance com-parisons against traditional supervised learning algorithmsdemonstrate that LapRLS outperforms the supervised clas-sifiers on several datasets especially when the number oftraining examples are minimal. As a particular application,the LapRLS performs considerably well on Caltech-101, anobject recognition dataset.

1. Introduction

In a setting where labeled data is hard to find or expen-sive to attain, we can formulate the notion of learning fromthe vast amounts of unlabeled instances in the data givena few minimal labels per class instance. Formally, semi-supervised learning addresses this problem by using a largeamount of unlabeled data along with labeled data to makebetter predictions of the class of the unlabeled data. Fun-damentally, the goal of semi-supervised classification is totrain a classifier f from both the labeled and unlabeled data,such that it is better than the supervised classifier trained onthe original labeled data alone. Several object learning tech-niques [2], [7] based on semi-supervised frameworks haveshown convincing results for classification with minimal la-bels.

Semi-supervised learning has tremendous practical valuein several domains [11] including speech recognition, pro-tein 3D structure prediction, video surveillance etc. In this

report, the primary focus is to learn trends from labeled im-age data that may be readily available from human annota-tion, or some external source, and using the learned knowl-edge to label the evergrowing data deluge of unlabeled im-ages on the internet. Particularly, the focus is on utilizingthese semi-supervised techniques on Caltech-101 [4], [5],an object recognition dataset while also providing convinc-ing results on toy datasets.

2. Background and related workBefore we delve into the details of the motivation and

implementation behind semi-supervised learning, it is im-portant to differentiate two distict forms of semi-supervisedlearning settings. In semi-supervised classification, thetraining dataset contains some unlabeled data, unlike inthe supervised setting. Therefore, there are two distinctgoals; one is to predict the labels on future test data, andthe other goal is to predict the labels on the unlabeled in-stances in the training dataset. The former is called in-ductive semi-supervised learning and the latter transductivelearning [12].

While it is reasonable that semi-supervised learning canuse additional unlabeled data to learn a better predictor f ,the key lies in the model assumptions about the relation be-tween the marginal distribution P (x) and the conditionaldistribution P (y|x).

(a) (b)

Figure 1. (a). Plots visualizing the cluster assumption, (b). and thecluster/manifold assumption

In this semi-supervised scenario, the idea is to use a reg-ularizer that prefers functions which vary smoothly alongthe manifold and do not vary in high density regions as

Figure 2. The plots depict the motivation for graph-based learn-ing where one can leverage the information from both labeled andunlabeled data

depicted in figure 1 (b). These assumptions known as thecluster/manifold assumption [12], predominantly ensure thedata from each class lies on a different continuous manifoldand that the similarity is measured along the manifold.

3. Graph-based Semi-Supervised LearningVia the cluster/manifold assumption, the learning en-

sures that we pick a regularizer that prefers functions thatare differentiable and vary smoothly along the manifold andthat does not vary in high density regions. In order to labelpoints that are similar to each other with the same label,we create a graph where the nodes represent the data pointsL ∪ U (labeled and unlabeled, this is discussed in furtherdetail in later sub-sections) and the edges represent the sim-ilarity measure between data points.

One can think of using the graph as a way to approximatethe manifold (and density) on which the data lies. With thismotivation in mind, we formalize the graph representationin our semi-supervised learning framework in the followingsections.

3.1. Graph representation

Let L = (x1, y1) . . . (xl, yl) be the labeled data withy ∈ 1, . . . , C, and U = xl+1, . . . , xl+u the unlabeled data,usually l � u, and n = l + u. The formulation of semi-supervised learning in a graph setting can be defined asfollows; Given training data (xi, yi)

li=1, xl+u

j=l+1, the ver-tices represent the labeled and unlabeled instances L ∪ U .Obviously, this graph turns out be fairly large. The semi-supervised classification setting requires learning what yvalues each of the nodes take in the graph. This is ac-complished by first connecting edges between each of thelabeled instance nodes in the graph with the unlabeled in-stances. As expected, these edges represent the similaritybetween each of the nodes xi, xj involved in the edge eij .Let wij represent the edge weight, in other words, a simi-larity measure between nodes xi, xj .

3.2. The Graph Laplacian

The above mentioned graph can be represented by ann×n weight matrix W , with wij . As expected, the weightsare non-negative and symmetric. An important quantity

from graph theory, the combinatorial Graph Laplacian, isthe matrix representation of a graph. The Graph Laplacianis defined as

L = D −W (1)

where Dii =∑

j Wij is the degree of node i and W is theweight matrix as defined above. An important point to noteis that f is constrained to take values f(i) = yi, i ∈ L onthe labeled data. Using the laplacian, one can define theenergy function for a graph by the following:

E(f) =1

2

∑i,j

wij(f(i)− f(j))2 = fTLf (2)

Minimization of the smoothness term fTLf can beachieved by the trivial solution of f = 1, but in a semi-supervised setting such as this one, the minimization is acombination of the smoothness and the training loss(labelagreement). For the squared training loss, this is defined as

J(f) = fTLf︸︷︷︸smoothness

+ (f − y)T Λ(f − y)︸︷︷︸label agreement

(3)

where Λ is the diagonal matrix whose diagonal elements areΛii = 0 for i ∈ U (unlabeled points), and Λii = λ for i ∈ L(labeled points). The final solution has a closed form to thefollowing equation (L+ Λ)f = Λy given by:

f = (L+ Λ)−1Λy (4)

Although the solution can be given in closed form for thesquared error loss, it requires a solution to an n × n linearsystem which poses serious problems when n grows larger.While this report doesn’t focus on larger datasets, it is im-portant to note that various approximations to the final so-lution exists for larger n, specifically as suggested in [6].

4. Experiments and ResultsIn the following section, the performance of the Graph-

based Laplacian Regularized Least Squares (LapRLS) algo-rithm is evaluated and tested on a variety of toy datasets andCaltech-101 [4], [5], an object recognition dataset.

4.1. Toy Datasets: 2 Moons Dataset

In order to guage and visualize the performance of theseaforementioned algorithms, a few toy datasets were gener-ated. Since the data is represented in 2D, we use the dataas is and build the weight matrix, and laplacian matrix forthe data. The edge weights are determined using an RBFwith σ = 0.35 for the case of the SVM-RBF Kernel, andthe LapRLS.

A classical dataset for classification is the 2 moonsdataset, where the datapoints can be easily classified byvisualizing the data, but non separable with linear kernel

(linear)

rlsc

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−0.5

0

0.5

1

(linear)

svm

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−0.5

0

0.5

1

(a) (b)

(rbf) : σ =0.2

svm

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−0.5

0

0.5

1

(rbf) : σ =0.2

laprlsc

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5

−0.5

0

0.5

1

(c) (d)Figure 3. Plots comparing the performance of RLS (a), Linear-SVM (b), RBF-SVM (c), and LapRLS (d) with a RBF kernel withincreasing number of training examples. The decision boundariesclearly indicate that the LapRLS outperforms in each of the scenar-ios and predicts decision boundaries even with a minimal trainingset (colored points are labeled training examples Nt).

SVMs, and RLS. The plots in figure 3 below show the per-formance of each of the classifiers discussed, with LapRLSoutperforming especially when the number of training ex-amples is minimal.

4.1.1 Performance compared to kNN classifiers

It is of particular interest to understand why the perfor-mance of the LapRLS with an RBF kernel is significantcompared to the other classifiers in this case. Since thereis continuity in the data manifold, one can propagate labelsaccordingly by minimizing the overall objective functionkeeping the regularity in manifold consistent. As comparedto a kNN classifier, the LapRLS avoids labeling the secondmanifold (data from a different class) with the same labelas itself since there are high derivatives near regions wheremanifolds come close to each other. Figure 4 compares atraditional kNN classifier with the graph-based LapRLS fordifferent values of k. The plots clearly show that the RBF-LapRLS outperform the kNN classifier in most cases (bothin cases where Nt is low and high). In the case of kNNclassifiers, the high misclassification error rate can be at-tributed to outliers in the data that tend to propagate labelsundesirably, leading to overall misclassification of the data.

4.1.2 Performance with Minimal Labels

When only a few training examples are available, bothlinear-SVMs and linear-RLS tend to find a half-plane thatbest classifies the training examples without any notion ofutilizing full information. The linear-SVM classifier was

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Misclassification rate (%) vs. number of training examples on 2moons

Number of training examples

Mis

cla

ssific

atio

n r

ate

(%

)

rlsc−linear

svm−linear

svm−rbf

laprlsc−rbf

Figure 4. Plots showing the decreasing misclassification error ratewith increasing number of training examples Nt. The figure tothe left compares various kNN classifiers with the LapRLS clas-sifier. The figure to the right compares the RLS, Linear-SVM,RBF-SVM and the LapRLS misclassification error rate. Both plotsshow how the LapRLS outperforms other classifiers in both cases.

modified to include a slack variable for datasets that werelinearly non-separable. Both these classifiers show simi-lar results, but overall poor performance on the 2 moonsdataset. While RLS and linear-SVM have relatively lowmisclassification error with fewer training examples Nt inthis case, as Nt increases, the misclassification rates tend toincrease.

In the case of the RBF-SVM (3rd column in figure 4),performance tends to increase with increasing Nt. This isexpected as the classifier tends to overfit to the training data,and as Nt approaches N (full test/train dataset size), theperformance should converge to optimal. As a side note, inorder to compare the performance between the RBF-SVMand the RBF-LapRLS, both variations were provided withthe same kernel parameters. The interesting performancemeasures are when the training size Nt is minimal (1st and2nd row). Here we see that the LapRLS outperforms theRBF-SVM by correctly classifying all the unlabeled pointsin the dataset, while the RBF-SVM uses only the trainingexamples to make classification decisions. This results inpoor performance and lack of generalization for the specificdataset as seen in rows 1, 2, and 3 in figure 4 (columns3 and 4 comparing RBF-SVM and LapRLS). The LapRLSperforms substantially better than any of the classifiers com-pared, even with minimal data, or with full training dataNt.

In short, utilizing unlabeled data does not necessarilyhurt the classification as long as the data belongs to either ofthe classes in question. If however, spurious data points thatcorrespond to a class that is not well represented or labeledin the training set tends to bias the classification, especiallywhen their manifolds overlap or approach each other.

Figure 5. The GIST descriptor for the corresponding image on theleft. The image on the right shows average filter energy in eachbin once oriented gabor filters are applied to the image.

4.2. Caltech-101

In order to test the true performance and utility ofthe Graph-based Laplacian Regularized Least Squares(LapRLS) approach, the algorithm was tested on Caltech-101, an object recognition dataset. The dataset consists of101 classes, each of which contains approximately 800 im-ages per object class. Since we limit ourselves to under-standing the trends in the data using minimal labels, only7 classes (Faces, Airplanes, Motorbikes, Cars, Sunflower,Chair, Emu) were considered in our experiments and tests.

4.2.1 Feature extraction and classification

To capture the main essence of the object in the image, boththe GIST descriptor [8] and the Pyramid Histogram of Ori-ented Gradients (PHOG) descriptor [1] was used separatelyto benchmark their capabilities. GIST, a 512-dimensionaldescriptor has a lot of mutual information between descrip-tors when extracted for different images. To this end, itsdimensionality is reduced to 64 dimensions via PCA, whilestill retaining over 98% of the variance in the data. PHOG,on the other hand is extracted by building a vocabulary ofwords/histograms (600 words in our case), to characterizethe objects in the image.

As expected, each features extracted via GIST andPHOG independently lie on a feature space that is specificto the feature extraction technique. As explained earlier,semi-supervised learning makes an assumption that thesefeatures extracted from images need to lie on a continousmanifold. [3], [8] suggests that euclidean distance is a suf-ficient metric to compute the similarity measure betweenGIST descriptors of images. [1] suggests that χ2 distanceas a good metric to measure the similarity between PHOGdescriptors of images. While other distance metrics maywork equally as well, the results look promising regardlessof the distance metric that was used.

A one-versus-all approach was taken to classify the 7classes from each other. Here each classifier fk, solution to

eqn. 4, would be responsible for predicting those examplesthat fall within the kth class as positive and classifying therest of the dataset as negative. In such a scenario, the one-versus-all formulation is given by f(x) = arg maxi fi(x),breaking ties arbitrarily.

4.2.2 Graph-based Laplacian RLS

Once the features are extracted and reduced in their dimen-sionality, the similarity measure is determined between eachpair of nodes in the dataset to build an n× n weight matrixas discussed in section 3. The LapRLS again takes an RBF-kernel to account for non-linear but continuous manifoldson which these feature descriptors lie on. Figures 6, 7 showthe performance of LapRLS (with GIST and PHOG inde-pendently) versus other kNN classifier (k = 2, 3, 6) withincreasing number of training examples Nt. Once again,LapRLS outperforms each of the kNN classifiers for mostof the cases for Nt = 20 (Average per-class misclassifica-tion error ∼10% and ∼13% for GIST and PHOG respec-tively), except for Nt = 1 (Average per-class misclassifi-cation error ∼43% and ∼27% for GIST and PHOG respec-tively). This may be attributed to the fact that the class-specific manifolds each of these GIST and PHOG featureslie on may not be entirely continous (or have high deriva-tives) causing the regularization of the label assignment forthe LapRLS to have poor performance. It could also be thecase that the specific training example that was provided forthe learning was not close to the edge of a manifold, makingthe discrimination between classes difficult as the labels arepropagated farther away from the original training set.

4.2.3 Performance with Minimal labels

The following figures 6, 7 show the performance of theLapRLS algorithm with increasing number of training ex-amples using GIST and PHOG descriptors. The figure tothe left shows the Receiver-Operating-Characteristic (ROC)curve for the performance of the object recognition task.The curves tend to imply that with increasing number oftraining examples, Nt, the performance continues to im-prove. More importantly, the LapRLS has considerable per-formance even with Nt = 1 (blue curve), and continues torise when Nt = 40 (orange curve). The figure on the rightshows the confusion matrix for the object recognition task.

The following figure (Figure 8 shows the results of theapplication of the Graph-based Laplacian Regularized LeastSquares framework with GIST features in an object recog-nition setting.

The LapRLS works as expected with promising perfor-mance even with limited training data samples. It is of par-ticular interest to realize the regions of poor performancefor the LapRLS case. In the case of faces or motorbikesin the figure 8, the faces and motorbikes that could not be

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Misclassification rate (%) vs. number of training examples


Mis

cla

ssific

atio

n r

ate

(%

)

LapRLSC

2−NN classifier

3−NN classifier

4−NN classifier

5−NN classifier

6−NN classifier

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

true negative rate

tru

e p

ositve

ra

te (

reca

ll)

ROC curve for increasing training examples using GIST

ROC n=1

ROC n=2

ROC n=4

ROC n=8

ROC n=10

ROC n=20

ROC rand.

Confusion matrix using GIST descriptor (66.43 % accuracy)

Faces airplanesMotorbikes car_side sunflower chair emu

Faces

airplanes

Motorbikes

car_side

sunflower

chair

emu

(a) (b) (c)Figure 6. Plot showing the performance of LapRLS on Caltech-101 using GIST descriptors. (a). Comparison against kNN classifiers withincreasing number of training examples. Once again, LapRLS seem to perform better for both small and larger Nt training samples. (b).ROC curve showing the performance of LapRLS on Caltech-101 with increasing number of training examples Nt. The figure shows thatLapRLS has considerable performance even with 5 training examples. (c). Confusion matrix of the multi-class classifcation on Caltech-101showing reasonably good performance on training with a global descriptor such as GIST, and a single training example

0 5 10 15 20 25 30 35 400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Misclassification rate (%) vs. number of training examples


Mis

cla

ssific

ation r

ate

(%

)

LapRLSC2−NN classifier

3−NN classifier

4−NN classifier

5−NN classifier6−NN classifier

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

true negative rate

true p

ositve r

ate

(re

call)

ROC curve for increasing training examples using PHOW

ROC n=1ROC n=2

ROC n=5

ROC n=20

ROC n=40ROC rand.

Confusion matrix using PHOW descriptor (61.14 % accuracy)

Faces airplanesMotorbikes car_side sunflower chair emu

Faces

airplanes

Motorbikes

car_side

sunflower

chair

emu

(a) (b) (c)Figure 7. Plot showing the performance of LapRLS on Caltech-101 using PHOW descriptors. (a). Comparison against kNN classifiers withincreasing number of training examples. Once again, LapRLS seem to perform better for both small and larger Nt training samples. (b).ROC curve showing the performance of LapRLS on Caltech-101 with increasing number of training examples Nt. The figure shows thatLapRLS has considerable performance even with 5 training examples. (c). Confusion matrix of the multi-class classifcation on Caltech-101showing reasonably good performance on training with a global descriptor such as GIST, and a single training example

classified correctly (column 4) seem to have a significantlydifferent background than the training set (column 1) or theimages that have been classified correctly (column 2). Thismay be due to the fact the feature descriptor that is usedto discriminate between images captures the backgroundinformation as well (or possibly looks at contrasting fore-ground/background scenes). That said, it would make sensethat the objects in the 4th column do not correlate too wellwith the original training example where the backgroundseems to be less cluttered. This may be accounted for in thetraining process where more optimal training labels can beactively suggested to the user for annotation while simula-teously optimizing the overall classification objective func-tion. This sub-domain is referred to as active learning [9]and has received quite a bit of reception from the computer

vision community in the recent years.Of particular interest is when one can combine the ca-

pabilities of multiple feature spaces to bootstrap an over-all better object classification system. In our case, whileGIST and PHOG were independently benchmarked, its per-formance together with a unified kernel [10] can prove tobe highly beneficial when one of the descriptors fail to clas-sify correctly in regions where the other descriptor performswell. While the kernelization of multiple descriptors wereattempted, much convincing results could not be producedbefore the conclusion of this report.

5. ConclusionTo summarize, it is demonstrated that a semi-supervised

framework can be very beneficial specifically in object

Figure 8. Final results on multi-class classification on Caltech-101 using a single training example with the Graph-based Laplacian Regu-larized Least Squares framework. Column 1 shows the original training datapoint per object class. Column 2 shows the set of images thathave been correctly classified via the LapRLS approach discussed. Column 3 shows the set of false positives obtained for the correspondingobject class. Column 4 shows the set of images within the object class that could not be classified correctly.

recognition, especially when the number of training datais limited. Furthermore, from the results it is evident thatthe aforementioned graph-based laplacian least squares canleverage the availability of unlabeled data in building a bet-ter classifier as compared to the traditional supervised clas-sifiers such as RLS, Linear-SVM and RBF-SVM. Whilethis report predominantly focuses on the performance ofsuch classifiers when the training data is limited, the resultsalso generalize quite well with increasing number of train-ing examples and still continue to be the lower bound ofmisclassification error rate from figures 4, 6, and, 7. As a fi-nal note, this report incorporates a novel Graph-based Reg-ularized Least Squares semi-supervised learning techniquefor multiclass-object recognition using GIST and PHOGfeatures, providing convincing recognition performance de-spite a minimal set of training examples.

References[1] A. Bosch, A. Zisserman, and X. Munoz. Representing shape

with a spatial pyramid kernel. In Proceedings of the ACM In-ternational Conference on Image and Video Retrieval, 2007.

[2] P. Carbonetto, G. Dork, C. Schmid, H. Kck, and N. D. Fre-

itas. A semi-supervised learning approach to object recogni-tion with spatial integration of local features and segmenta-tion cues. In In Toward Category-Level Object Recognition,pages 277–300, 2006.

[3] M. Douze, H. Jegou, H. Sandhawalia, L. Amsaleg, andC. Schmid. Evaluation of gist descriptors for web-scale im-age search. In International Conference on Image and VideoRetrieval. ACM, july 2009.

[4] L. Fei-Fei, R. Fergus, and P. Perona. A bayesian approachto unsupervised one-shot learning of object categories. vol-ume 2, page 1134, 2003.

[5] L. Fei-Fei, R. Fergus, and P. Perona. Learning generativevisual models from few training examples: An incrementalbayesian approach tested on 101 object categories. Comput.Vis. Image Underst., 106(1):59–70, Apr. 2007.

[6] R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learn-ing in gigantic image collections. In Y. Bengio, D. Schu-urmans, J. Lafferty, C. K. I. Williams, and A. Culotta, ed-itors, Advances in Neural Information Processing Systems22, pages 522–530. 2009.

[7] C. Leistner, H. Grabner, and H. Bischof. Semi-supervisedboosting using visual similarity learning. In IN: PROC.CVPR, 2008.

[8] A. Oliva and A. Torralba. Modeling the shape of the scene: Aholistic representation of the spatial envelope. InternationalJournal of Computer Vision, 42:145–175, 2001.

[9] B. Settles. Active learning literature survey. Technical report,2010.

[10] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.Sun database: Large-scale scene recognition from abbey tozoo. In CVPR, pages 3485–3492. IEEE, 2010.

[11] X. Zhu. Semi-supervised learning literature survey, 2006.[12] X. Zhu and A. B. Goldberg. Introduction to Semi-Supervised

Learning. Synthesis Lectures on Artificial Intelligence andMachine Learning. Morgan & Claypool Publishers, 2009.

semi-supervised object detector learning from minimal...

Documents