accurate object detection with joint classification-regression …€¦ · object detection is one...

8
Accurate Object Detection with Joint Classification-Regression Random Forests Samuel Schulter Christian Leistner Paul Wohlhart Peter M. Roth Horst Bischof Graz University of Technology Microsoft Photogrammetry Institute for Computer Graphics and Vision Austria {schulter,wohlhart,pmroth,bischof}@icg.tugraz.at [email protected] Abstract In this paper, we present a novel object detection ap- proach that is capable of regressing the aspect ratio of ob- jects. This results in accurately predicted bounding boxes having high overlap with the ground truth. In contrast to most recent works, we employ a Random Forest for learn- ing a template-based model but exploit the nature of this learning algorithm to predict arbitrary output spaces. In this way, we can simultaneously predict the object proba- bility of a window in a sliding window approach as well as regress its aspect ratio with a single model. Furthermore, we also exploit the additional information of the aspect ra- tio during the training of the Joint Classification-Regression Random Forest, resulting in better detection models. Our experiments demonstrate several benefits: (i) Our approach gives competitive results on standard detection benchmarks. (ii) The additional aspect ratio regression de- livers more accurate bounding boxes than standard object detection approaches in terms of overlap with ground truth, especially when tightening the evaluation criterion. (iii) The detector itself becomes better by only including the as- pect ratio information during training. 1. Introduction Object detection is one of the most important tasks in computer vision. Modern object detectors have to be both accurate and fast as they are often employed in time-critical applications, such as gaming or robotics. Moreover, they also act as a building block in various other applications like semantic segmentation [19, 28] or scene recognition [24]. In recent years, the progress in this field has been tremen- dous. Popular fast and accurate detection approaches can be roughly subdivided into three different strands: First, the works based on the rigid Dalal and Triggs [10] detec- tor using HOG features and SVMs. In particular, the De- formable Parts Model [15] extends the rigid detector with a part-based multi-component model and achieves state-of- the-art results on many benchmarks. Second, object detec- Figure 1: Illustrative output of the proposed detector (red) and a state-of-the-art method (blue). The given values show the overlap with the green ground truth bounding box, where our detector achieves a much higher overlap because it regresses the aspect ratio of the object. tors building on the work of Viola and Jones [29] using Boosting and various feature channels [5, 12, 13]. These rigid detectors are less flexible in terms of object outlines, but achieve state-of-the-art results on pedestrian detection benchmarks and run in real-time. Third, detection mod- els operating over small patches that vote for object cen- ters with the generalized Hough transform [16, 22]. These detectors are very flexible and powerful in detecting body parts [26] or fiducial points in faces [11] but are less suc- cessful on common object detection benchmarks. The quantitative evaluation of all these detection models typically focuses on the amount of true and false positive detections in a certain data set. True positives are commonly characterized by a predefined overlap (50% in most bench- marks) of the detection with the ground truth bounding box. However, the question arises if a criterion based on such an arbitrarily chosen threshold actually reflects the quality of an object detector. Consider for instance the blue bounding box in Figure 1, a detection given by a template-based de-

Upload: others

Post on 21-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

Accurate Object Detection with Joint Classification-Regression Random Forests

Samuel Schulter† Christian Leistner‡ Paul Wohlhart† Peter M. Roth† Horst Bischof†

†Graz University of Technology ‡Microsoft PhotogrammetryInstitute for Computer Graphics and Vision Austria

{schulter,wohlhart,pmroth,bischof}@icg.tugraz.at [email protected]

Abstract

In this paper, we present a novel object detection ap-proach that is capable of regressing the aspect ratio of ob-jects. This results in accurately predicted bounding boxeshaving high overlap with the ground truth. In contrast tomost recent works, we employ a Random Forest for learn-ing a template-based model but exploit the nature of thislearning algorithm to predict arbitrary output spaces. Inthis way, we can simultaneously predict the object proba-bility of a window in a sliding window approach as well asregress its aspect ratio with a single model. Furthermore,we also exploit the additional information of the aspect ra-tio during the training of the Joint Classification-RegressionRandom Forest, resulting in better detection models.

Our experiments demonstrate several benefits: (i) Ourapproach gives competitive results on standard detectionbenchmarks. (ii) The additional aspect ratio regression de-livers more accurate bounding boxes than standard objectdetection approaches in terms of overlap with ground truth,especially when tightening the evaluation criterion. (iii)The detector itself becomes better by only including the as-pect ratio information during training.

1. IntroductionObject detection is one of the most important tasks in

computer vision. Modern object detectors have to be bothaccurate and fast as they are often employed in time-criticalapplications, such as gaming or robotics. Moreover, theyalso act as a building block in various other applications likesemantic segmentation [19, 28] or scene recognition [24].

In recent years, the progress in this field has been tremen-dous. Popular fast and accurate detection approaches canbe roughly subdivided into three different strands: First,the works based on the rigid Dalal and Triggs [10] detec-tor using HOG features and SVMs. In particular, the De-formable Parts Model [15] extends the rigid detector witha part-based multi-component model and achieves state-of-the-art results on many benchmarks. Second, object detec-

Figure 1: Illustrative output of the proposed detector (red)and a state-of-the-art method (blue). The given valuesshow the overlap with the green ground truth bounding box,where our detector achieves a much higher overlap becauseit regresses the aspect ratio of the object.

tors building on the work of Viola and Jones [29] usingBoosting and various feature channels [5, 12, 13]. Theserigid detectors are less flexible in terms of object outlines,but achieve state-of-the-art results on pedestrian detectionbenchmarks and run in real-time. Third, detection mod-els operating over small patches that vote for object cen-ters with the generalized Hough transform [16, 22]. Thesedetectors are very flexible and powerful in detecting bodyparts [26] or fiducial points in faces [11] but are less suc-cessful on common object detection benchmarks.

The quantitative evaluation of all these detection modelstypically focuses on the amount of true and false positivedetections in a certain data set. True positives are commonlycharacterized by a predefined overlap (50% in most bench-marks) of the detection with the ground truth bounding box.However, the question arises if a criterion based on such anarbitrarily chosen threshold actually reflects the quality ofan object detector. Consider for instance the blue boundingbox in Figure 1, a detection given by a template-based de-

Page 2: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

tector similar to [5, 12, 13], which is a true positive but hasan overlap of only 59% with the green ground truth bound-ing box. Thus, tightening the overlap criterion would dras-tically decrease the overall performance of such a detector,as we show in our experiments.

In this work, we propose a novel object detection ap-proach capable of predicting more accurate bounding boxeswith a Joint Classification-Regression Random Forest for-mulation similar to [16, 17]. We exploit the fact that RFs,can predict arbitrary output spaces, cf. [14, 16, 18], and aug-ment the label space for object detection with the boundingbox size, which we exploit during both training and testing.In this way, we cannot only predict the foreground proba-bility of a detection but can also regress the extent of theobject with a single model, alleviating the need for learningmany mixture models [15]. An illustrative result is shownin Figure 1, where the red bounding box is the output ofour detection algorithm, which has a much higher overlapof 89% with the ground truth, thus identifying the extent ofthe object more accurately.

In our experiments, we demonstrate that our object de-tection approach yields state-of-the-art results on severaldata sets and compare it with related approaches like HoughForests [16], DPM [15] and a Boosting-based rigid templateapproach [5, 12, 13]. More importantly, we also show thatour approach can accurately regress the bounding box as-pect ratio of objects in unseen test images, which is alsoreflected in the results. Namely, the detection performanceof most typical detectors breaks down when increasing theoverlap criterion for true positives, while our approach stillgives comparably good results.

2. Related WorkThe most related approaches to ours are the works of

Dollar et al. [12, 13] and Benenson et al. [5], who build ona similar detection pipeline and employ the same features.The influential work on Integral Channel Features [13]computes several feature channels, including color, gradi-ent magnitude and orientation quantized gradients, whichis similar to [5]. Both works rely on an efficient Boost-ing framework for learning the object models. This lineof work almost exclusively focuses on pedestrian detectionand there are some extensions that make this frameworkextremely fast [4] or exploit application-specific knowl-edge [3]. However, only fixed size bounding boxes are pre-dicted, which is reasonable for pedestrians, but is a limita-tion if dealing with arbitrary objects.

The Deformable Parts Model (DPM) [15] extends therigid HOG template and SVM approach of [10] and in-cludes deformable parts and multiple components. Thismixture model captures the intra-class variability by sepa-rating the training data according to the aspect ratio (newerversions also include appearance), thus enabling the predic-

tion of a discrete set of aspect ratios. Furthermore, a linearregression on the inferred part locations refines the aspectratio prediction. Nevertheless, the DPM has some disad-vantages because (i) different models have to be trained foreach component and (ii) the aspect ratio prediction per com-ponent is limited on a linear model solely based on the partlocations. In particular, the first issue limits the DPM in twoways: First, the model becomes slower during both train-ing and testing, and second, the more components are em-ployed, the less training data is available for each of them.In contrast, we have a single model that can exploit all thetraining data and can predict a continuous aspect ratio.

Blaschko and Lampert [7] formulate the detection as aregression problem and train a structured output SVM forlearning. While yielding accurate results, a fast localiza-tion method is necessary to have reasonable runtimes dur-ing both training and detection. They use Efficient Sub-window Search, which requires computing an upper boundon the detection score, thus limiting the choice of featuresand learning method [20]. Our work is more flexible in thechoice of features, faster during training and also has a rea-sonable runtime within a sliding window scheme.

We also note that RFs, in general, have rarely been em-ployed for object detection. One exception is the HoughForest framework [16], which describes an object as a set ofsmall patches that are connected to a reference point, typi-cally the center of the object. However, this patch-based ap-proach is relatively slow compared to other detection mod-els. To overcome this issue, [27] proposes a two-level ap-proach for speeding up the detection process. Nevertheless,it still relies on the Hough voting scheme for the final pre-diction, where the non maximum suppression is a delicatetask, cf. [30]. While Hough Forests [16] typically predict afixed bounding box, it can also handle variable aspect ratios;either via back-projecting the voting elements, which thendefine the bounding box, or via voting in a third dimensionin Hough space. However, employing the back-projectionis rather slow or memory intensive, while increasing theHough space dimensionality hampers the maximum search.

Recently, [23] showed a holistic RF model that trainslocal experts (SVMs) in each node. However, this modelalso builds on a fixed bounding box prediction and was onlyevaluated on pedestrian detection benchmarks.

3. Joint Classification-Regression ForestsTo derive our Joint Classification-Regression Forest

(JCRF) formulation, we first give a brief review of standardRandom Forests (RF) in Section 3.1 and of the basic ob-ject detection model in Section 3.2. Then, we show how thelabel space is augmented for predicting variable boundingbox aspect ratios in Section 3.3. Finally, the training anddetection procedures of JCRF are illustrated in Sections 3.4and 3.5, respectively.

Page 3: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

3.1. Random Forests

Random Forests [1, 8, 9] (RF) are ensembles of T binarydecision trees ft(x) : X → Y , where X = Rn is the n-dimensional feature space and Y = {0, 1} describes thelabel space1.

During testing, each decision tree returns a class proba-bility distribution pt(y|x) for a given test sample x and thefinal class label y∗ is calculated via

y∗ = arg maxy

1

T

T∑t=1

pt(y|x) . (1)

During training, the decision trees are provided withtraining data T = (xi, yi)

Ni=1, where N is the number of

training examples, and all trees are trained independentlyfrom each other. For training a single decision tree, the pa-rameters Θ of a splitting function

Φ(x,Θ) =

{0 if rΘ(x) < 01 otherwise , (2)

which separates the data into two disjoint sets, have to be es-timated. In Equation (2), rΘ(x)→ R calculates a responseof the feature vector x. The quality of a given splitting func-tion Φ is typically defined as

I(Θ) =|L|

|L|+ |R|H(L) +

|R||L|+ |R|

H(R) , (3)

where L = {x : Φ(x,Θ) = 0}, R = {x : Φ(x,Θ) = 1},| · | denotes the size of a set, and H(·) measures the purityof a set of training examples in terms of class labels. Thepurity H(·) is typically calculated via the entropy or theGini index [8].

The standard procedure in Random Forests for findinga good splitting function in a single node is to randomlysample a set of parameters {Θj}kj=1 and simply choosingthe best one, Θ∗, by evaluating Equation (3). This splittingfunction is then fixed, the training data is separated accord-ingly and the tree growing continues until some stoppingcriteria, such as a maximum tree depth or a minimum num-ber of samples in the node, are reached.

3.2. Object Detection Model

The training data for our detection model is a set of rigidtemplates P with size h × w, which is the mean boundingbox size of all the objects in the training set, normalized to100 pixel width or height, depending on what is larger2. Asin [12], the size of this template also includes a padding of20% in order to capture some context around the objects.

1Please note that we only consider the binary case here as our applica-tion is binary object detection. However, RF are inherently multi-class.

2Throughout the paper, we will always refer to a fixed-width model.

Figure 2: Training samples capturing objects with differ-ent height. Features are computed from the whole template(blue box), but the regression targets zi are different.

We center such a template at each of the annotated train-ing objects and on randomly sampled bounding boxes fromnegative images. For each of the cropped training exam-ples, we calculate 10 different feature channels: We use the3 LUV color channels, the gradient magnitude, and 6 gradi-ent channels, quantized into equally sized orientation bins,similar to [5, 12, 13].

The training data T , i.e., positive and negative sam-ples, is thus given as a set of pairs (xi, yi)

Ni=1, where

xi ∈ Rh×w×10 and yi ∈ {0, 1}. Given this data, we traina Random Forest F(x) = 1

T ft(x) with the standard objec-tive given in Equation (3). As splitting functions, we usepixel-pair test, cf. [16], and thus define

rΘ(x) = x(Θ1l ,ΘC)− x(Θ2

l ,ΘC)−Θσ , (4)

where Θ1l and Θ2

l describe two 2D locations within the tem-plate P , ΘC ∈ {1, . . . , 10} defines a feature channel andΘσ ∈ R is a threshold. Please note that in the followingsections, we will extend the label space and also provide adifferent objective function for training the model.

We also include three rounds of bootstrapping after theinitial training of the Random Forest. In each round, a set ofhard negative windows is identified by applying the currentmodel on the negative images, which are then added to thepool of negative data. In each round we re-train the RandomForest from scratch.

3.3. Augmenting the Label Space

As described in the previous section, each training ex-ample is cropped and scaled such that it fits in the templateP of size h × w. The scaling factor is defined by the fixedmodel width. Thus, the actual height of the objects capturedin the training images most likely varies with the viewpointof the object. See Figure 2 for some examples. In order togive predictions about the correct height, and thus the aspectratio of an object, we additionally store the actual height zfor each of the training examples.

Therefore, we augment the label space with the groundtruth height of each positive training example, which ex-tends the label space to Y = {0, 1} × R. Our train-ing set now becomes a set of triplets (xi, yi, zi)

Ni=1, where

Page 4: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

yi ∈ {0, 1} still corresponds to the positive and negative la-bel, and zi ∈ R is the correct bounding box height. Pleasenote that for background training samples, i.e., yi = 0, zi isundefined.

3.4. Training the Random Forest Model

For training our Joint Classification-Regression Forest(JCRF), we have two objectives. First, we want to sepa-rate positive and negative classes and, second, we want toregress the bounding box height for positive samples. Sim-ilar to Hough Forests [16], we use two separate split nodetypes that optimize the different objective functions. Thefirst node type is a (binary) classification node and the sec-ond is a regression node.

For classification nodes, we use the recently proposedAlternating Decision Forests (ADF) [25] that show how awell defined loss function can be optimized over all trees,while still being able to parallelize the training. For em-ploying ADF, we assign each training sample xi a corre-sponding weightwi, similar to Boosting. Then, the trees aretrained breadth-first and after training each level of depth,the weights are updated according to the respective loss,which was calculated over all trees. The purity measureH(·) in Equation (3) is still the same, i.e., the entropy.

For regression nodes, we employ a standard reduction-in-variance approach, where the purity measureH(·) is thusdefined as the variance of the target values zi in the corre-sponding sets. Please note that the regression objective isonly evaluated with the positive training data.

In general, a single splitting node decides randomly withequal probability which of the two objectives are being op-timized. As in Hough Forests [16], we also assign certainlevels of depth in the tree a fixed type of evaluation objectivethat has to be optimized. We thus introduce the steering pa-rameter γ which has different interpretations: While settingγ = 0 ignores the regression objective at all, setting γ > 0indicates that starting with depth γ, only regression nodesare evaluated in all trees, similar to [16]. In this work, weadditionally allow setting γ < 0, where only the first levelsup to depth |γ| are optimizing the regression objective.

Tree growing stops as in standard Random Forests wheneither the maximum tree depth is reached or not enoughtraining examples are available for further splitting. In con-trast to standard Random Forests, where tree growing alsostops if a certain node becomes pure in terms of class labels,in this setting, we continue splitting nodes containing onlypositives but fix the splitting objective to be regression.

The resulting leaf nodes then calculate (i) the class his-togram based on the training data falling in that leaf and (ii)the mean of the regression targets z of all positive trainingexamples. As thus each tree can return two kinds of outputs,we denote by fCt (x) the classification output of tree t andby fRt (x) the regression output.

3.5. Detection with Aspect Ratio Regression

For detecting objects in unseen test images, we em-ploy a standard sliding window approach over the scale-space. The score of a window W at location (x, y) inthe image is given by the classification output of the RFs = FC(x) = 1

T

∑Tt=1 f

Ct (x). As we use an ensemble

method which consists of several independent weak classi-fiers (randomized trees in this case), we could parallelizetheir evaluation to achieve a higher detection speed. How-ever, given a certain detection threshold τ below which de-tections are discarded, we can also iteratively evaluate thetrees and employ an early-stopping scheme. Moreover, wecan still benefit from parallel processing, e.g., at the scalespace pyramid or for the simultaneous evaluation of multi-ple images.

In our early stopping approach, we examine whether ornot the trees in the RF not evaluated up to now can the-oretically achieve such a high foreground probability thatthe total score for that window W exceeds the detectionthreshold τ . Assume that we already evaluated the treesup to index t < T , the current unnormalized score thus issi≤t =

∑ti=1 f

Ci (x). The upper bound of the score from

the remaining trees is si>t =∑Ti=t+1 1.0 = T − t. There-

fore, if

si≤t + si>t < τ · T , (5)

we can already stop evaluating the feature vector x in thecurrent windowW . For instance, having T = 10 and a de-tection threshold τ = 0.95, the evaluation can already stopif the first tree has a foreground probability fC1 (x) < 0.5.Using this approach, we can reject clear negative windowsduring the evaluation process very quickly without reducingdetection performance.

For each of window with s ≥ τ we evaluate fRt (x) foreach tree in the JCRF to return the prediction about the re-gression target, i.e., the height of the object captured in thecurrent window. The final estimate z of the object height isgiven as the average over all independent trees:

z = FR(x) =1

T

T∑t=1

fRt (x) . (6)

Please note that a mode seeking approach like mean shiftcould also be employed, but averaging turned out to be agood choice in our setting. The resulting detection windowincluding the detection score in the original scale is thengiven by D = (xκ ,

yκ ,

w=100κ , zκ , s), where κ is the scale of

the detection.After having identified a set of potential detections for a

test image, we apply a standard greedy non-maximum sup-pression approach that removes detections having an over-lap greater than 50% with a higher scoring detection [15].

Page 5: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

4. ExperimentsIn our experiments, we demonstrate the performance of

our object detection approach. First, we compare with state-of-the-art methods on three different data sets with a stan-dard object detection evaluation criterion. Second, we alsoevaluate the detection performance of all methods when thisevaluation criterion is tightened. Finally, we analyze ourtrained model and the most relevant parameters.

4.1. Overall Performance Evaluation

We first evaluate the overall detection performance of theproposed approach on three standard benchmarks. We in-vestigate three different variants of our approach: StdRFimplements a standard Random Forest disregarding the re-gression information at all. StdRF-Regr trains a RF and in-cludes the regression information during both training andtesting. ADF-Regr employs the training scheme from [25]for classification nodes, see Section 3.4.

In addition, we give a comparison to state-of-the-art de-tection approaches. First, we evaluate Aggregate ChannelFeatures (ACF) [12], a Boosting-based approach similarto [5, 13] that builds on the same detection pipeline (i.e., thesame features and bootstrapping scheme) as our detector.Second, we also compare with Hough Forests [16] (Hough-Forest) that also rely on a Random Forest framework forlearning the object model, however, it works on the patch-level and employs the generalized Hough voting scheme fordetection. Please note that both approaches only predict asingle bounding box aspect ratio, which is averaged overthe training data. Finally, we also compare with the De-formable Parts Model [15] (DPMfull ), where we addition-ally evaluate a version that only uses the root filter DPM-root to have a fair comparison with the other approachesbuilding on a rigid template. However, more important forour scenario are the multiple components included in thismodel, which are defined by clustering the aspect ratio ofthe training bounding boxes. To denote the different ver-sions of [15] we add the number of components (1, 2 or 4)as postfix to DPMroot or DPMfull, respectively. For all ap-proaches building on randomization steps, i.e., [16] and ourvariants, we average the results over three independent runs.Data sets: We use three different data sets for evaluationpurpose, namely the ETHZcars [21], the TUDpedestrian [2]and the MITStreetsceneCars [6]. For the ETHZcars dataset, we use 420 training and 175 testing images that cap-ture cars from different viewing angles. Although the testset defines a rather easy detection task, because it showscars prominently with little background, it exactly fits theneeds for evaluating the quality of the bounding box detec-tions. Namely, the aspect ratio of the ground truth bound-ing boxes in the test set strongly varies. The TUDpedestriandata set contains 400 training images and 250 test images.Also in this scenario, the aspect ratio varies due to different

articulations. Finally, the MITStreetsceneCars data set is alarger data set capturing cars in different street scene sce-narios and under different illumination. We split this dataset in 2

3 training and 13 testing images, resulting in 2909 and

1020 images, respectively.Evaluation Criterion: The evaluation criterion in this ex-periment is the commonly used Pascal overlap. For eachdetection D it calculates the overlap with a ground truthbounding box G as the Intersection over Union:

IoU(D,G) =D ∩ GD ∪ G

. (7)

The IoU(D,G) overlap separates all detections D in true(IoU ≥ ξ) or false (IoU < ξ) positives, which are usedfor drawing precision-recall curves and calculating the AreaUnder Curve (AUC) measure. The parameter ξ is the suc-cess threshold that is typically, and also in this experiment,set to 0.5. Note that in the following section we evaluate thedetection results when setting ξ > 0.5.Results: We depict our results as precision-recall curvesfor all data sets in Figure 3 and report the AUC values inthe legend. As can be seen, the proposed ADF-Regr is al-ways en par with the best performing approach and clearlywins on one of the data sets. We also note that ADF-Regr istypically better than ACF, except for the ETHZcars wherethe difference is insignificant, and the results are rather satu-rated. On the TUDpedestrian and the MITStreetsceneCars,our approach is 11.7% and 7.7% better than ACF, respec-tively. Furthermore, our approach performs significantlybetter than Hough Forests, the only other Random Forestbased method. Finally, we also note that ADF-Regr outper-forms all versions of DPM, except for DPMfull2 and DPM-full4 on the ETHZcars data set, and that including the partsin the DPM does not always improve the results on thesedata sets.

Comparing the different proposed variants, i.e., StdRF,StdRF-Regr and ADF-Regr, we see that including the re-gression output typically gives significantly better perfor-mance and that the ADF learning scheme [25] further im-proves the results.

4.2. Tightening the Pascal Overlap Criterion

In this experiment, we directly evaluate the quality ofthe bounding box predictions of the different methodsand investigate the performance of our joint classification-regression prediction in more detail. To do so, we use adifferent type of evaluation curve. Following the standardPascal criterion, we draw precision-recall curves and mea-sure the AUC. However, we vary the success threshold ξ fora true positive detection, see Equation 7. We then plot theAUC value over ξ in the range between 0.5 to 1.0. Exceptfrom the evaluation criterion, the experimental setup is thesame as in the previous experiment.

Page 6: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

ACF: auc=98.3StdRF: auc=94.9StdRF−Regr: auc=97.8ADF−Regr: auc=98.1HoughForest: auc=95.2DPMroot1: auc=95.4DPMroot2: auc=97.9DPMroot4: auc=95.7DPMfull1: auc=96.9DPMfull2: auc=99.5DPMfull4: auc=99.4

(a)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

ACF: auc=77.1StdRF: auc=73.5StdRF−Regr: auc=84.1ADF−Regr: auc=88.7HoughForest: auc=84.5DPMroot1: auc=76.5DPMroot2: auc=83.1DPMroot4: auc=82.4DPMfull1: auc=82.2DPMfull2: auc=83.3DPMfull4: auc=65.7

(b)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cis

ion

ACF: auc=56.4StdRF: auc=60.2StdRF−Regr: auc=59.2ADF−Regr: auc=64.1HoughForest: auc=41.6DPMroot1: auc=61.5DPMroot2: auc=63.2DPMroot4: auc=63.8DPMfull1: auc=60.9DPMfull2: auc=60.1DPMfull4: auc=57.8

(c)Figure 3: Precision recall curves for all evaluated approaches on three different data sets: (a) ETHZcars, (b) TUDpedestrianand (c) MITStreetsceneCars.

Results: We illustrate our results for all three data sets inFigure 4. As can be seen, the proposed ADF-Regr givesthe best results on two benchmarks and is en par with DPM(2 or 4 components) on one benchmark. The performancedifference between all methods is most pronounced on theETHZcars data set (Figure 4a), which shows most aspect ra-tio variations. We can see that HoughForest, ACF, StdRF,DPMroot1 and also DPMfull1 drastically lose performanceif the success threshold ξ gets increased. For instance, set-ting ξ = 0.7, none of these approaches achieve higher AUCthan 50%. Using DPM with 2 or 4 components (regard-less of the parts) increases the performance to 62% and80%, respectively. This is intuitive as increasing the num-ber of components also increases the number of aspect ra-tios that can be predicted. However, further adding com-ponents will likely decrease the performance as less train-ing examples will be available per component (can be ob-served on TUDped and MITStreetsceneCars). In contrast,our Joint Classification-Regression Random Forest formu-lations, StdRF-Regr and ADF-Regr, can predict the bound-ing boxes even more accurately, improving over DPMroot4by 6% for ξ = 0.7 and by 41% for ξ = 0.8. Our best vari-ant, ADF-Regr achieves an AUC value of 73% at the verytight success threshold of ξ = 0.8, while all methods pre-dicting a fixed bounding box do not exceed an AUC valueof 20%. We illustrate some qualitative results in Figure 5.

4.3. Analysis of the Random Forest Model

In this section, we analyze the most relevant parametersof our RF framework. We evaluate the number of trees T ,as well as the parameter γ that regulates the amount of re-gression nodes evaluated during training (for StdRF-Regrand ADF-Regr). Furthermore, we also investigate the fea-ture selection process of Random Forests when includingthe regression objective during the training phase.Number of trees: First, we evaluate the number of treesT when fixing the maximum tree depth to 12 on the TUD-pedestrian data set. In this setting, the additional regression

25 50 75 100 125 1500.7

0.8

0.9

1

Number of treesA

rea U

nder

Curv

e (

AU

C)

(a)

−4 −3 −2 0 7 8 9 100.9

0.92

0.94

0.96

0.98

γ

Are

a U

nd

er

Cu

rve

(A

UC

)

(b)

Figure 6: (a) Evaluation of the number of trees T on theTUD-pedestrian data set. (b) Evaluation of the parameterγ that influences the behavior of the regression objective inthe RF. We plot the accuracy as (AUC) for different valuesof γ for the ETHZcars data set.

is also turned on and the parameter γ is set to −2. The re-sults are depicted in Figure 6a. As expected, we can see aclear trend of increasing performance with the number oftrees T up to a certain limit T = 100. For more trees, theresults are saturated.Regression-Only Parameter: We further analyze the pa-rameter γ. For this evaluation, we use the ETHZcars dataset, which provides the most variation in aspect ratios. InFigure 6b, we see the mean and standard deviation of 3 in-dependent runs for different values of γ. We can observethat using too many or too few regression nodes is unfavor-able. Best performance can be observed by setting γ eitherto −2 or 9 (when using trees with maximum depth of 12).Feature Selection: We also visualize the different featureselection processes for StdRF-Regr and StdRF, i.e., with orwithout including the regression objective, on the car modelof the ETHZcars data set. To do so, we count how often acertain location in the rigid template was selected for per-forming a split in the Random Forest, regardless of the fea-ture channel. Again, we set γ = −2 for StdRF-Regr.

Figure 7 illustrates the behavior for both trainingschemes when summing up all selected feature locations; Inthis setting, we used 100 trees with a maximum tree depthof 12 over 2 independent runs. Interestingly, we can observe

Page 7: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Pascal Overlap

Are

a U

nd

er

Pre

cis

ion

Re

ca

ll C

urv

e

ACFStdRFStdRF−RegrADF−RegrHoughForestDPMroot1DPMroot2DPMroot4DPMfull1DPMfull2DPMfull4

(a)

0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Pascal Overlap

Are

a U

nd

er

Pre

cis

ion

Re

ca

ll C

urv

e

ACFStdRFStdRF−RegrADF−RegrHoughForestDPMroot1DPMroot2DPMroot4DPMfull1DPMfull2DPMfull4

(b)

0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Pascal Overlap

Are

a U

nd

er

Pre

cis

ion

Re

ca

ll C

urv

e

ACFStdRFStdRF−RegrADF−RegrHoughForestDPMroot1DPMroot2DPMroot4DPMfull1DPMfull2DPMfull4

(c)Figure 4: Area under the Curve (AUC) for increasingly harder Intersection over Union success thresholds for different datasets: (a) ETHZcars, (b) TUDpedestrian and (c) MITStreetsceneCars.

Figure 5: Example results when comparing a fixed bounding box prediction and a flexible one. The green box is the groundtruth, the blue one is the fixed size bounding box, i.e., the mean over the training examples, and the red one is the flexiblebounding box predicted with our Joint Classification-Regression Forest.

that the feature selection is very different for the two learn-ing schemes. StdRF concentrates on positions on the leftand right side of the cars, which is clear as the width of themodel is fixed and, thus, the gradients at these locations arepresent in all training images. Furthermore, for this trainingscheme, also the bottom of the templates are frequently se-lected. This can be explained by the fact that cars typicallystand on the street and also have gradients on the bottom,which is typically not the case in negative images.

On the other hand, StdRF-Regr evaluates a much morediverse set of locations as it also tries to separate the differ-ent viewing angles, e.g., frontal-view from side-view cars.Thus, it selects the features at different y-coordinates in thecenter of the x-dimension. Of course, this training schemealso evaluates classification nodes, thus we can observe sim-ilarly selected locations from StdRF as well.Classification model: Finally, we also evaluate if the re-gression objective during training also improves the classi-fication performance of the RF. That is, we train two mod-els, one including the regression objective (γ = −2) andone only evaluating classification nodes (γ = 0). During

testing, however, we only use the fixed mean bounding boxin order to turn off the effect of regressing the boundingbox height on the final AUC performance. We observe thatincluding regression during training improves the perfor-mance regardless if the regression output is actually usedduring detection. On the TUDpedestrian data set, we get78.3 ± 1.2% AUC without using the regression objective(γ = 0) and 83.3 ± 3.1% when including it (γ = −2). In-cluding the regression output during the testing phase forpredicting variable bounding box aspect ratios further im-proves the results, cf., Section 4.1.

5. Conclusion

In this work, we proposed a Random Forest based ob-ject detection model that is capable of predicting vari-able bounding box aspect ratios. We augmented the stan-dard binary label space accordingly with a regression tar-get for predicting the bounding box aspect ratio. OurJoint Classification-Regression Forest (JCRF) formulationexploits the additional label information during both train-

Page 8: Accurate Object Detection with Joint Classification-Regression …€¦ · Object detection is one of the most important tasks in computer vision. Modern object detectors have to

Total sum feature selection

20 40 60 80 100 120

10

20

30

40

50

60

(a)

Total sum feature selection

20 40 60 80 100 120

10

20

30

40

50

60

(b)Figure 7: Feature selection of both training schemes: (a)Only classification and (b) including regression.

ing and testing. For training JCRF, we employ the ADFtraining scheme and include regression nodes for optimiz-ing the bounding box aspect ratio.

Our results on common object detection benchmarksshowed that our proposed detection model is better or enpar with related state-of-the-art approaches. Furthermore,our experiments revealed that most commonly used objectdetectors that predict a fixed bounding box size break downas soon as the evaluation criterion becomes tighter in termsof overlap with the ground truth bounding box. In contrast,our JCRF formulation can efficiently deal with variable as-pect ratios in a single model and achieves good results evenif the overlap criterion gets harder.

Acknowledgement: This work was supported by the AustrianScience Foundation (FWF) project Advanced Learning for Track-ing and Detection in Medical Workow Analysis (I535-N23)

References[1] Y. Amit and D. Geman. Shape Quantization and Recognition

with Randomized Trees. NECO, 9(7):1545–1588, 1997. 3[2] M. Andriluka, S. Roth, and B. Schiele. People-Tracking-

by-Detection and People-Detection-by-Tracking. In CVPR,2008. 5

[3] R. Benenson, M. Mathias, R. Timofte, and L. v. Gool. Faststixel computation for fast pedestrian detection. In ECCVWorkshops, 2012. 2

[4] R. Benenson, M. Mathias, R. Timofte, and L. v. Gool. Pedes-trian detection at 100 frames per second. In CVPR, 2012. 2

[5] R. Benenson, M. Mathias, T. Tuytelaars, and L. v. Gool.Seeking the strongest rigid detector. In CVPR, 2013. 1, 2, 3,5

[6] S. M. Bileschi. StreetScenes: Towards Scene Understandingin Still Images. PhD thesis, Massachusetts Institute of Tech-nology, Department of Electrical Engineering and ComputerScience, 2006. 5

[7] M. B. Blaschko and C. H. Lampert. Learning to LocalizeObjects with Structured Output Regression. In ECCV, 2008.2

[8] L. Breiman. Random Forests. ML, 45(1):5–32, 2001. 3[9] A. Criminsi and J. Shotton. Decision Forests for Computer

Vision and Medical Image Analysis. Springer, 2013. 3[10] N. Dalal and B. Triggs. Histograms of oriented gradients for

human detection. In CVPR, 2005. 1, 2

[11] M. Danone, J. Gall, G. Fanelli, and L. v. Gool. Real-time Fa-cial Feature Detection using Conditional Regression Forests.In CVPR, 2012. 1

[12] P. Dollar, R. Appel, S. Belongie, and P. Perona. Fast FeaturePyramids for Object Detection. PAMI, 2014. 1, 2, 3, 5

[13] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral ChannelFeatures. In BMVC, 2009. 1, 2, 3, 5

[14] P. Dollar and L. Zitnick. Structured Forests for Fast EdgeDetection. In ICCV, 2013. 2

[15] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained partbased models. PAMI, 32(9):1627–1645, 2010. 1, 2, 4, 5

[16] J. Gall and V. Lempitsky. Class-Specific Hough Forests forObject Detection. In CVPR, 2009. 1, 2, 3, 4, 5

[17] B. Glocker, O. Pauly, E. Konukoglu, and A. Criminisi.Joint Classification-Regression Forests for Spatially Struc-tured Multi-Object Segmentation. In ECCV, 2012. 2

[18] P. Kontschieder, S. Rota Bulo, H. Bischof, and M. Pelillo.Structured Class-Labels in Random Forests for Semantic Im-age Labelling. In ICCV, 2011. 2

[19] L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr.What,Where & How Many? Combining Object Detectorsand CRFs. In ECCV, 2010. 1

[20] C. H. Lampert, M. B. Blaschko, and T. Hofmann. BeyondSliding Windows: Object Localization by Efficient Subwin-dow Search. In CVPR, 2008. 2

[21] B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool. Dy-namic 3D Scene Analysis from a Moving Vehicle. In CVPR,2007. 5

[22] B. Leibe, A. Leonardis, and B. Schiele. Combined object cat-egorization and segmentation with an implicit shape model.In ECCV Workshops, 2004. 1

[23] J. Marın, D. Vazquez, A. M. Lopez, J. Amores, and B. Leibe.Random Forests of Local Experts for Pedestrian Detection.In ICCV, 2013. 2

[24] M. Pandey and S. Lazebnik. Scene Recognition and WeaklySupervised Object Localization with Deformable Part-BasedModels. In ICCV, 2011. 1

[25] S. Schulter, P. Wohlhart, C. Leistner, A. Saffari, P. M. Roth,and H. Bischof. Alternating Decision Forests. In CVPR,2013. 4, 5

[26] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake. Real-Time Human PoseRecognition in Parts from a Single Depth Image. In CVPR,2011. 1

[27] D. Tang, Y. Liu, and T.-K. Kim. Fast Pedestrian Detection byCascaded Random Forest with Dominant Orientation Tem-plates. In BMVC, 2012. 2

[28] V. Vineet, J. Warrell, L. Ladicky, and P. H. S. Torr. HumanInstance Segmentation from Video using Detector-basedConditional Random Fields. In BMVC, 2011. 1

[29] P. Viola and M. Jones. Robust Real-time Object Detection.IJCV, 57(2):137–154, 2004. 1

[30] P. Wohlhart, M. Donoser, P. M. Roth, and H. Bischof. Detect-ing Partially Occluded Objects with an Implicit Shape ModelRandom Field. In ACCV, 2012. 2