ross girshick's dpm slides

Post on 20-Jan-2017

229 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deformable part models

Ross GirshickUC Berkeley

CS231B Stanford University Guest Lecture April 16, 2013

Image understanding

photo by “thomas pix” http://www.flickr.com/photos/thomaspix/2591427106

Snack time in the lab

What objects are where?

..

.

I seetwinkies!

robot: “I see a table with twinkies,pretzels, fruit, and some mysterious chocolate things...”

DPM lecture overview

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVM’s.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The8th ICCV, Vancouver, Canada, pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. dePrins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100–105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):82–98, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):45–68, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 66–75, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):63–86, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 69–81, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349–361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):15–33, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700–714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151–177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181–194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734–741, 2003.

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu

David McAllesterToyota Technological Institute at Chicago

mcallester@tti-c.org

Deva RamananUC Irvine

dramanan@ics.uci.edu

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6,10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

AP 12% 27% 36% 45% 49% 2005 2008 2009 2010 2011

Part 1: modeling

Part 2: learning

Formalizing the object detection task

Many possible ways

Input

person

motorbike

Desired output

Many possible ways, this one is popular:

Formalizing the object detection task

cat,dog,chair,cow,person,motorbike,car,...

Input

person

motorbike

Desired output

Performance summary:

Average Precision (AP)0 is worst 1 is perfect

Many possible ways, this one is popular:

Formalizing the object detection task

cat,dog,chair,cow,person,motorbike,car,...

Benchmark datasets

PASCAL VOC 2005 – 2012 - 54k objects in 22k images - 20 object classes - annual competition

Benchmark datasets

PASCAL VOC 2005 – 2012 - 54k objects in 22k images - 20 object classes - annual competition

Reduction to binary classification

Figure 2. Some sample images from our new human detection database. The subjects are always upright, but with some partial occlusionsand a wide range of variations in pose, appearance, clothing, illumination and background.

probabilities to be distinguished more easily. We will oftenuse miss rate at 10!4FPPW as a reference point for results.This is arbitrary but no more so than, e.g. Area Under ROC.In a multiscale detector it corresponds to a raw error rate ofabout 0.8 false positives per 640!480 image tested. (The fulldetector has an even lower false positive rate owing to non-maximum suppression). Our DET curves are usually quiteshallow so even very small improvements in miss rate areequivalent to large gains in FPPW at constant miss rate. Forexample, for our default detector at 1e-4 FPPW, every 1%absolute (9% relative) reduction in miss rate is equivalent toreducing the FPPW at constant miss rate by a factor of 1.57.

5 Overview of ResultsBefore presenting our detailed implementation and per-

formance analysis, we compare the overall performance ofour final HOG detectors with that of some other existingmethods. Detectors based on rectangular (R-HOG) or cir-cular log-polar (C-HOG) blocks and linear or kernel SVMare compared with our implementations of the Haar wavelet,PCA-SIFT, and shape context approaches. Briefly, these ap-proaches are as follows:Generalized Haar Wavelets. This is an extended set of ori-ented Haar-like wavelets similar to (but better than) that usedin [17]. The features are rectified responses from 9!9 and12!12 oriented 1st and 2nd derivative box filters at 45" inter-vals and the corresponding 2nd derivative xy filter.PCA-SIFT. These descriptors are based on projecting gradi-ent images onto a basis learned from training images usingPCA [11]. Ke & Sukthankar found that they outperformedSIFT for key point based matching, but this is controversial[14]. Our implementation uses 16!16 blocks with the samederivative scale, overlap, etc., settings as our HOG descrip-tors. The PCA basis is calculated using positive training im-ages.Shape Contexts. The original Shape Contexts [1] used bi-nary edge-presence voting into log-polar spaced bins, irre-spective of edge orientation. We simulate this using our C-HOG descriptor (see below) with just 1 orientation bin. 16angular and 3 radial intervals with inner radius 2 pixels andouter radius 8 pixels gave the best results. Both gradient-

strength and edge-presence based voting were tested, withthe edge threshold chosen automatically to maximize detec-tion performance (the values selected were somewhat vari-able, in the region of 20–50 graylevels).Results. Fig. 3 shows the performance of the various detec-tors on the MIT and INRIA data sets. The HOG-based de-tectors greatly outperform the wavelet, PCA-SIFT and ShapeContext ones, giving near-perfect separation on the MIT testset and at least an order of magnitude reduction in FPPWon the INRIA one. Our Haar-like wavelets outperform MITwavelets because we also use 2nd order derivatives and con-trast normalize the output vector. Fig. 3(a) also shows MIT’sbest parts based and monolithic detectors (the points are in-terpolated from [17]), however beware that an exact compar-ison is not possible as we do not know how the database in[17] was divided into training and test parts and the nega-tive images used are not available. The performances of thefinal rectangular (R-HOG) and circular (C-HOG) detectorsare very similar, with C-HOG having the slight edge. Aug-menting R-HOG with primitive bar detectors (oriented 2nd

derivatives – ‘R2-HOG’) doubles the feature dimension butfurther improves the performance (by 2% at 10!4 FPPW).Replacing the linear SVM with a Gaussian kernel one im-proves performance by about 3% at 10!4 FPPW, at the costof much higher run times1. Using binary edge voting (EC-HOG) instead of gradient magnitude weighted voting (C-HOG) decreases performance by 5% at 10!4 FPPW, whileomitting orientation information decreases it by much more,even if additional spatial or radial bins are added (by 33% at10!4 FPPW, for both edges (E-ShapeC) and gradients (G-ShapeC)). PCA-SIFT also performs poorly. One reason isthat, in comparison to [11], many more (80 of 512) principalvectors have to be retained to capture the same proportion ofthe variance. This may be because the spatial registration isweaker when there is no keypoint detector.

6 Implementation and Performance StudyWe now give details of our HOG implementations and

systematically study the effects of the various choices on de-1We use the hard examples generated by linear R-HOG to train the ker-

nel R-HOG detector, as kernel R-HOG generates so few false positives thatits hard example set is too sparse to improve the generalization significantly.

pos = { ... ... }

neg = { ... background patches ... }

Descriptor Cues

input image weightedpos wts

weightedneg wts

avg. grad outside in block

The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important

Histograms of Oriented Gradients for Human Detection – p. 11/13

SVM “Sliding window” detector

Dalal & Triggs (CVPR’05)

HOG

Sliding window detection

• Compute HOG of the whole image at multiple resolutions

• Score every subwindow of the feature pyramid

• Apply non-maxima suppression

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVM’s.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The8th ICCV, Vancouver, Canada, pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. dePrins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100–105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):82–98, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):45–68, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 66–75, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):63–86, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 69–81, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349–361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):15–33, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700–714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151–177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181–194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734–741, 2003.

Image pyramid HOG feature pyramid

p( , ) = w · �( , )

Detection

p number of locations p ~ 250,000 per image

Detection

p number of locations p ~ 250,000 per image

test set has ~ 5000 images

>> 1.3x109 windows to classify

Detection

p number of locations p ~ 250,000 per image

test set has ~ 5000 images

>> 1.3x109 windows to classify

typically only ~ 1,000 true positive locations

Detection

p number of locations p ~ 250,000 per image

test set has ~ 5000 images

>> 1.3x109 windows to classify

typically only ~ 1,000 true positive locations

Extremely unbalanced binary classification

Dalal & Triggs detector on INRIA3.5 Overview of Results 27

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

Recall!Precision !! different descriptors on INRIA static person database

Ker. R!HOGLin. R!HOGLin. R2!HogWaveletPCA!SIFTLin. E!ShapeC

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

ision

Recall−Precision −− descriptors on INRIA static+moving person database

R−HOG + IMHmdR−HOGWavelet

(a) (b)

Fig. 3.6. The performance of selected detectors on the INRIA static (left) and static+moving(right) person data sets. For both of the data sets, the plots show the substantial overall gainsobtained by using HOG features rather than other state-of-the-art descriptors. (a) Comparesstatic HOG descriptors with other state of the art descriptors on INRIA static person data set.(b) Compares combined the static and motion HOG, the static HOG and the wavelet detectorson the combined INRIA static and moving person data set.

[2001] but also includes both 1st and 2nd-order derivative filters at 45� interval and the corre-sponding 2nd derivative xy filter. It yields AP of 0.53. Shape contexts based on edges (E-ShapeC)perform considerably worse with an AP of 0.25. However, Chapter 4 will show that generalisedshape contexts [Mori and Malik 2003], which like standard shape contexts compute circularblocks with cells shaped over a log-polar grid, but which use both image gradients and orienta-tion histograms as in R-HOG, give similar performance. This highlights the fact that orientationhistograms are very effective at capturing the information needed for object recognition.

For the video sequences we compare our combined static and motion HOG, static HOG, andHaar wavelet detectors. The detectors were trained and tested on training and test portions ofthe combined INRIA static and moving person data set. Details on how the descriptors and thedata sets were combined are presented in Chapter 6. Figure 3.6(b) summarises the results. TheHOG-based detectors again significantly outperform the wavelet based one, but surprisinglythe combined static and motion HOG detector does not seem to offer a significant advantageover the static HOG one: The static detector gives an AP of 0.553 compared to 0.527 for themotion detector. These results are surprising and disappointing because Sect. 6.5.2, where weused DET curves (c.f . Sect. B.1) for evaluations, shows that for exactly the same data set, theindividual window classifier for the motion detector gives significantly better performance thanthe static HOG window classifier with false positive rates about one order of magnitude lowerthan those for the static HOG classifier. We are not sure what is causing this anomaly and arecurrently investigating it. It seems to be linked to the threshold used for truncating the scoresin the mean shift fusion stage (during non-maximum suppression) of the combined detector.

• AP = 75%

• (79% in my implementation)

• Very good

• Declare victory and go home?

Dalal & Triggs on PASCAL VOC 2007

AP = 12%

(using my implementation)

Descriptor Cues

input image weightedpos wts

weightedneg wts

avg. grad outside in block

The most important cuesare head, shoulder, legsilhouettesVertical gradients insidethe person count asnegativeOverlapping blocks thosejust outside the contourare the most important

Histograms of Oriented Gradients for Human Detection – p. 11/13

How can we do better?

Revisit an old idea: part-based models (“pictorial structures”)- Fischler & Elschlager ‘73, Felzenszwalb & Huttenlocher ’00

Combine with modern features and machine learning

Part-based models

• Parts — local appearance templates

• “Springs” — spatial connections between parts (geom. prior)

Image: [Felzenszwalb and Huttenlocher 05]

Part-based models

• Local appearance is easier to model than the global appearance

- Training data shared across deformations

- “part” can be local or global depending on resolution

• Generalizes to previously unseen configurations

General formulation

= (�,�)

� = (�ͳ, . . . , ��) � � �� �

(�ͳ, . . . , ��) � ��v1

v2

p

part locations in the image(or feature pyramid)

Part configuration score function

p

score(�ͳ, . . . , ��) =��

�=ͳ��(��) �

(�,�)�����(��, ��)

Part match scores

spring costs

v1

v2

Highest scoring configurations

Part configuration score function

• Objective: maximize score over p1,...,pn

• hn configurations! (h = |P|, about 250,000)

• Dynamic programming

- If G = (V,E) is a tree, O(nh2) general algorithm

‣ O(nh) with some restrictions on dij

score(�ͳ, . . . , ��) =��

�=ͳ��(��) �

(�,�)�����(��, ��)

Part match scores

spring costs

Star-structured deformable part models

test image “star” model detection

root part

Recall the Dalal & Triggs detector

• HOG feature pyramid

• Linear filter / sliding-window detector

• SVM training to learn parameters w

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVM’s.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The8th ICCV, Vancouver, Canada, pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. dePrins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100–105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):82–98, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):45–68, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 66–75, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):63–86, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 69–81, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349–361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):15–33, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700–714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151–177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181–194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734–741, 2003.

Image pyramid HOG feature pyramid

( , ) = w · �( , )p

D&T + parts

• Add parts to the Dalal & Triggs detector- HOG features- Linear filters / sliding-window detector- Discriminative training

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu

David McAllesterToyota Technological Institute at Chicago

mcallester@tti-c.org

Deva RamananUC Irvine

dramanan@ics.uci.edu

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6,10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

[FMR CVPR’08][FGMR PAMI’10]

p0

z

Image pyramid HOG feature pyramid

root

Sliding window DPM score function

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu

David McAllesterToyota Technological Institute at Chicago

mcallester@tti-c.org

Deva RamananUC Irvine

dramanan@ics.uci.edu

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6,10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

p0

z

Spring costsFilter scores

= ( , . . . , )

score( , ) = max,...,

=( , )�

=( , )

Image pyramid HOG feature pyramid

root

Detection in a slide

+

x

xx

...

...

...

model

response of root filter

transformed responses

responses of part filters

feature map feature map at 2x resolution

detection scores for each root location

low value high value

color encoding of filter response values

root filter

1-st part filter n-th part filter

test image

�Ͳ

��

max��

[��(��) � ��(�Ͳ, ��)]

What are the parts?

Aspect soup

General philosophy: enrich models to better represent the data

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP — a factor of two above thebest previous score. Finally, we trained a model with parts

but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

6

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP — a factor of two above thebest previous score. Finally, we trained a model with parts

but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

6

aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Our rank 3 1 2 1 1 2 2 4 1 1 1 4 2 2 1 1 2 1 4 1Our score .180 .411 .092 .098 .249 .349 .396 .110 .155 .165 .110 .062 .301 .337 .267 .140 .141 .156 .206 .336Darmstadt .301

INRIA Normal .092 .246 .012 .002 .068 .197 .265 .018 .097 .039 .017 .016 .225 .153 .121 .093 .002 .102 .157 .242INRIA Plus .136 .287 .041 .025 .077 .279 .294 .132 .106 .127 .067 .071 .335 .249 .092 .072 .011 .092 .242 .275

IRISA .281 .318 .026 .097 .119 .289 .227 .221 .175 .253MPI Center .060 .110 .028 .031 .000 .164 .172 .208 .002 .044 .049 .141 .198 .170 .091 .004 .091 .034 .237 .051

MPI ESSOL .152 .157 .098 .016 .001 .186 .120 .240 .007 .061 .098 .162 .034 .208 .117 .002 .046 .147 .110 .054Oxford .262 .409 .393 .432 .375 .334

TKK .186 .078 .043 .072 .002 .116 .184 .050 .028 .100 .086 .126 .186 .135 .061 .019 .036 .058 .067 .090

Table 1. PASCAL VOC 2007 results. Average precision scores of our system and other systems that entered the competition [7]. Emptyboxes indicate that a method was not tested in the corresponding class. The best score in each class is shown in bold. Our current systemranks first in 10 out of 20 classes. A preliminary version of our system ranked first in 6 classes in the official competition.

Bottle

Car

Bicycle

Sofa

Figure 4. Some models learned from the PASCAL VOC 2007 dataset. We show the total energy in each orientation of the HOG cells inthe root and part filters, with the part filters placed at the center of the allowable displacements. We also show the spatial model for eachpart, where bright values represent “cheap” placements, and dark values represent “expensive” placements.

in the PASCAL competition was .16, obtained using a rigidtemplate model of HOG features [5]. The best previous re-sult of .19 adds a segmentation-based verification step [20].Figure 6 summarizes the performance of several models wetrained. Our root-only model is equivalent to the modelfrom [5] and it scores slightly higher at .18. Performancejumps to .24 when the model is trained with a LSVM thatselects a latent position and scale for each positive example.This suggests LSVMs are useful even for rigid templatesbecause they allow for self-adjustment of the detection win-dow in the training examples. Adding deformable parts in-creases performance to .34 AP — a factor of two above thebest previous score. Finally, we trained a model with parts

but no root filter and obtained .29 AP. This illustrates theadvantage of using a multiscale representation.

We also investigated the effect of the spatial model andallowable deformations on the 2006 person dataset. Recallthat si is the allowable displacement of a part, measured inHOG cells. We trained a rigid model with high-resolutionparts by setting si to 0. This model outperforms the root-only system by .27 to .24. If we increase the amount ofallowable displacements without using a deformation cost,we start to approach a bag-of-features. Performance peaksat si = 1, suggesting it is useful to constrain the part dis-placements. The optimal strategy allows for larger displace-ments while using an explicit deformation cost. The follow-

6

Mixture models

Data driven: aspect, occlusion modes, subclasses

FMR CVPR ’08: AP = 0.27 (person)

FGMR PAMI ’10: AP = 0.36 (person)

(a) Car component 1 (initial parts)

(b) Car component 1 (trained parts)

(c) Car component 2 (initial parts)

(d) Car component 2 (trained parts)

(e) Car component 3 (initial parts)

(f) Car component 3 (trained parts)

Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

62

(a) Car component 1 (initial parts)

(b) Car component 1 (trained parts)

(c) Car component 2 (initial parts)

(d) Car component 2 (trained parts)

(e) Car component 3 (initial parts)

(f) Car component 3 (trained parts)

Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

62

(a) Car component 1 (initial parts)

(b) Car component 1 (trained parts)

(c) Car component 2 (initial parts)

(d) Car component 2 (trained parts)

(e) Car component 3 (initial parts)

(f) Car component 3 (trained parts)

Figure 4.3: Car components with parts initialized by interpolated the root filter to twice itsresolution (a,c,e), and parts after training with LSVM or WL-SSVM (b,d,f).

62

Pushmi–pullyu?

Good generalization properties on Doctor Dolittle’s farm

This was supposed todetect horses

( + ) / 2 =

Latent orientation

Unsupervised left/right orientation discovery

FGMR PAMI ’10: AP = 0.36 (person)

voc-release5: AP = 0.45 (person)

Publicly available code for the whole system: current voc-release5

0.42

0.47

0.57

horse AP

Summary of results

(a) (b) (c) (d) (e) (f) (g)Figure 6. Our HOG detectors cue mainly on silhouette contours (especially the head, shoulders and feet). The most active blocks arecentred on the image background just outside the contour. (a) The average gradient image over the training examples. (b) Each “pixel”shows the maximum positive SVM weight in the block centred on the pixel. (c) Likewise for the negative SVM weights. (d) A test image.(e) It’s computed R-HOG descriptor. (f,g) The R-HOG descriptor weighted by respectively the positive and the negative SVM weights.

would help to improve the detection results in more generalsituations.Acknowledgments. This work was supported by the Euro-pean Union research projects ACEMEDIA and PASCAL. Wethanks Cordelia Schmid for many useful comments. SVM-Light [10] provided reliable training of large-scale SVM’s.

References[1] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. The8th ICCV, Vancouver, Canada, pages 454–461, 2001.

[2] V. de Poortere, J. Cant, B. Van den Bosch, J. dePrins, F. Fransens, and L. Van Gool. Efficient pedes-trian detection: a test case for svm based categorization.Workshop on Cognitive Vision, 2002. Available online:http://www.vision.ethz.ch/cogvis02/.

[3] P. Felzenszwalb and D. Huttenlocher. Efficient matching ofpictorial structures. CVPR, Hilton Head Island, South Car-olina, USA, pages 66–75, 2000.

[4] W. T. Freeman and M. Roth. Orientation histograms forhand gesture recognition. Intl. Workshop on Automatic Face-and Gesture- Recognition, IEEE Computer Society, Zurich,Switzerland, pages 296–301, June 1995.

[5] W. T. Freeman, K. Tanaka, J. Ohta, and K. Kyuma. Com-puter vision for computer games. 2nd International Confer-ence on Automatic Face and Gesture Recognition, Killington,VT, USA, pages 100–105, October 1996.

[6] D. M. Gavrila. The visual analysis of human movement: Asurvey. CVIU, 73(1):82–98, 1999.

[7] D. M. Gavrila, J. Giebel, and S. Munder. Vision-based pedes-trian detection: the protector+ system. Proc. of the IEEE In-telligent Vehicles Symposium, Parma, Italy, 2004.

[8] D. M. Gavrila and V. Philomin. Real-time object detection forsmart vehicles. CVPR, Fort Collins, Colorado, USA, pages87–93, 1999.

[9] S. Ioffe and D. A. Forsyth. Probabilistic methods for findingpeople. IJCV, 43(1):45–68, 2001.

[10] T. Joachims. Making large-scale svm learning practical. InB. Schlkopf, C. Burges, and A. Smola, editors, Advances inKernel Methods - Support Vector Learning. The MIT Press,Cambridge, MA, USA, 1999.

[11] Y. Ke and R. Sukthankar. Pca-sift: A more distinctive rep-resentation for local image descriptors. CVPR, Washington,DC, USA, pages 66–75, 2004.

[12] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 60(2):91–110, 2004.

[13] R. K. McConnell. Method of and apparatus for pattern recog-nition, January 1986. U.S. Patent No. 4,567,610.

[14] K. Mikolajczyk and C. Schmid. A performance evaluation oflocal descriptors. PAMI, 2004. Accepted.

[15] K. Mikolajczyk and C. Schmid. Scale and affine invariantinterest point detectors. IJCV, 60(1):63–86, 2004.

[16] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detec-tion based on a probabilistic assembly of robust part detectors.The 8th ECCV, Prague, Czech Republic, volume I, pages 69–81, 2004.

[17] A. Mohan, C. Papageorgiou, and T. Poggio. Example-basedobject detection in images by components. PAMI, 23(4):349–361, April 2001.

[18] C. Papageorgiou and T. Poggio. A trainable system for objectdetection. IJCV, 38(1):15–33, 2000.

[19] R. Ronfard, C. Schmid, and B. Triggs. Learning to parse pic-tures of people. The 7th ECCV, Copenhagen, Denmark, vol-ume IV, pages 700–714, 2002.

[20] Henry Schneiderman and Takeo Kanade. Object detectionusing the statistics of parts. IJCV, 56(3):151–177, 2004.

[21] Eric L. Schwartz. Spatial mapping in the primate sensory pro-jection: analytic structure and relevance to perception. Bio-logical Cybernetics, 25(4):181–194, 1977.

[22] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. The 9th ICCV, Nice,France, volume 1, pages 734–741, 2003.

A Discriminatively Trained, Multiscale, Deformable Part Model

Pedro FelzenszwalbUniversity of Chicagopff@cs.uchicago.edu

David McAllesterToyota Technological Institute at Chicago

mcallester@tti-c.org

Deva RamananUC Irvine

dramanan@ics.uci.edu

Abstract

This paper describes a discriminatively trained, multi-scale, deformable part model for object detection. Our sys-tem achieves a two-fold improvement in average precisionover the best performance in the 2006 PASCAL person de-tection challenge. It also outperforms the best results in the2007 challenge in ten out of twenty categories. The systemrelies heavily on deformable parts. While deformable partmodels have become quite popular, their value had not beendemonstrated on difficult benchmarks such as the PASCALchallenge. Our system also relies heavily on new methodsfor discriminative training. We combine a margin-sensitiveapproach for data mining hard negative examples with aformalism we call latent SVM. A latent SVM, like a hid-den CRF, leads to a non-convex training problem. How-ever, a latent SVM is semi-convex and the training prob-lem becomes convex once latent information is specified forthe positive examples. We believe that our training meth-ods will eventually make possible the effective use of morelatent information such as hierarchical (grammar) modelsand models involving latent three dimensional pose.

1. IntroductionWe consider the problem of detecting and localizing ob-

jects of a generic category, such as people or cars, in staticimages. We have developed a new multiscale deformablepart model for solving this problem. The models are trainedusing a discriminative procedure that only requires bound-ing box labels for the positive examples. Using these mod-els we implemented a detection system that is both highlyefficient and accurate, processing an image in about 2 sec-onds and achieving recognition rates that are significantlybetter than previous systems.

Our system achieves a two-fold improvement in averageprecision over the winning system [5] in the 2006 PASCALperson detection challenge. The system also outperformsthe best results in the 2007 challenge in ten out of twenty

This material is based upon work supported by the National ScienceFoundation under Grant No. 0534820 and 0535174.

Figure 1. Example detection obtained with the person model. Themodel is defined by a coarse template, several higher resolutionpart templates and a spatial model for the location of each part.

object categories. Figure 1 shows an example detection ob-tained with our person model.

The notion that objects can be modeled by parts in a de-formable configuration provides an elegant framework forrepresenting object categories [1–3, 6,10, 12, 13,15, 16, 22].While these models are appealing from a conceptual pointof view, it has been difficult to establish their value in prac-tice. On difficult datasets, deformable models are often out-performed by “conceptually weaker” models such as rigidtemplates [5] or bag-of-features [23]. One of our main goalsis to address this performance gap.

Our models include both a coarse global template cov-ering an entire object and higher resolution part templates.The templates represent histogram of gradient features [5].As in [14, 19, 21], we train models discriminatively. How-ever, our system is semi-supervised, trained with a max-margin framework, and does not rely on feature detection.We also describe a simple and effective strategy for learn-ing parts from weakly-labeled data. In contrast to computa-tionally demanding approaches such as [4], we can learn amodel in 3 hours on a single CPU.

Another contribution of our work is a new methodologyfor discriminative training. We generalize SVMs for han-dling latent variables such as part positions, and introduce anew method for data mining “hard negative” examples dur-ing training. We believe that handling partially labeled datais a significant issue in machine learning for computer vi-sion. For example, the PASCAL dataset only specifies a

1

[DT’05]AP 0.12

[FMR’08]AP 0.27 [FGMR’10]

AP 0.36 [GFM voc-release5]AP 0.45

[GFM’11]AP 0.49

Part 2: DPM parameter learning

? ?

??

???

?

??

?

?

component 1 component 2

fixed model structure

Part 2: DPM parameter learning

? ?

??

???

?

??

?

?

component 1 component 2

fixed model structure training images y

+1

Part 2: DPM parameter learning

? ?

??

???

?

??

?

?

component 1 component 2

fixed model structure training images y

+1

-1

Part 2: DPM parameter learning

? ?

??

???

?

??

?

?

component 1 component 2

fixed model structure training images y

+1

-1Parameters to learn: – biases (per component) – deformation costs (per part) – filter weights

Linear parameterization

Spring costsFilter scores

= ( , . . . , )

score( , ) = max,...,

=( , )�

=( , )

( , ) = w · �( , )

( , ) = d · ( , , , )

Filter scores

Spring costs

( , ) = max w · ( , ( , ))

Positive examples (y = +1)

We want

w( ) = max� ( )

w · ( , )

to score >= +1

( ) includes all z with more than 70% overlap with ground truth

x specifies an image and bounding box

person

Negative examples (y = -1)

x specifies an image and a HOG pyramid location p0

We want

w( ) = max� ( )

w · ( , )

to score <= -1

( ) restricts the root to p0 and allows any placement of the other filters

p0

Typical dataset

300 – 8,000 positive examples

500 million to 1 billion negative examples(not including latent configurations!)

Large-scale*

*unless someone from google is here

How we learn parameters: latent SVM

(w) = �w� +�

max{ , � w( )}

(w) = �w� +�

max{ , � w( )}

(w) = �w� +�

�max{ , � max

� ( )w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

How we learn parameters: latent SVM

(w) = �w� +�

max{ , � w( )}

(w) = �w� +�

�max{ , � max

� ( )w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

w

+ score

z1

z2 z3

z4

convex

How we learn parameters: latent SVM

(w) = �w� +�

max{ , � w( )}

w

– score

z1

z2z3

z4

(w) = �w� +�

�max{ , � max

� ( )w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

w

+ score

z1

z2 z3

z4

convexconcave :(

How we learn parameters: latent SVM

Observations

w

– score

z1

z2z3

z4

w

+ score

z1

z2 z3

z4

convexconcave :(

Latent SVM objective is convex in the negatives

but not in the positives

>> “semi-convex”

Convex upper bound on loss

w

– score

z1

z2z3

z4

w (current)

w

– score

z1

ZPi = z2

z3

z4

w (current)

max{ , � max� ( )

w · ( , )}

max{ , �w · ( , )}

convex

Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

(w, ) = �w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

(w, ) = �w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

Note that (w, ) � min (w, ) = (w)

Auxiliary objective

Let ZP = {ZP1, ZP2, ... }

(w, ) = �w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

w� = minw,

(w, ) = minw

(w)and

Note that (w, ) � min (w, ) = (w)

Auxiliary objective

w� = minw,

(w, ) = minw

(w)

This isn’t any easier to optimize

Auxiliary objective

w� = minw,

(w, ) = minw

(w)

This isn’t any easier to optimize

Find stationary point by coordinate descent on (w, )

Auxiliary objective

w� = minw,

(w, ) = minw

(w)

This isn’t any easier to optimize

Find stationary point by coordinate descent on (w, )

Initialization: either by picking a w(0) (or ZP)

Auxiliary objective

w� = minw,

(w, ) = minw

(w)

This isn’t any easier to optimize

Find stationary point by coordinate descent on (w, )

Initialization: either by picking a w(0) (or ZP)

Step 1:= argmax

� ( )w( ) · ( , ) � �

Auxiliary objective

w� = minw,

(w, ) = minw

(w)

This isn’t any easier to optimize

Find stationary point by coordinate descent on (w, )

Initialization: either by picking a w(0) (or ZP)

Step 1:

Step 2:w( + ) = argmin

w(w, )

= argmax� ( )

w( ) · ( , ) � �

Step 1

This is just detection:

+

x

xx

...

...

...

model

response of root filter

transformed responses

responses of part filters

feature map feature map at 2x resolution

detection scores for each root location

low value high value

color encoding of filter response values

root filter

1-st part filter n-th part filter

test image

= argmax� ( )

w( ) · ( , ) � �

Step 2

minw

�w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

Convex

Step 2

minw

�w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

Convex

Similar to a structural SVM

Step 2

minw

�w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

Convex

Similar to a structural SVM

But, recall 500 million to 1 billion negative examples!

Step 2

minw

�w� +�

�max{ , �w · ( , )}

+�

�max{ , + max

� ( )w · ( , )}

Convex

Similar to a structural SVM

But, recall 500 million to 1 billion negative examples!

Can be solved by a working set method – “bootstrapping” – “data mining” – “constraint generation” – requires a bit of engineering to make this fast

Comments

Latent SVM is mathematically equivalent to MI-SVM (Andrews et al. NIPS 2003)

Latent SVM can be written as a latent structural SVM (Yu and Joachims ICML 2009)

– natural optimization algorithm is concave-convex procedure– similar to, but not exactly the same as, coordinate descent

xi1

bag of instances for xi

xi2

xi3

z1

z2

z3

latent labels for xi

What about the model structure?

? ?

??

???

?

??

?

?

component 1 component 2

fixed model structure training images y

+1

-1

Model structure – # components – # parts per component – root and part filter shapes – part anchor locations

Learning model structure

Split positives by aspect ratio

Warp to common size

Train Dalal & Triggs model for each aspect ratio on its own

Learning model structure

Use D&T filters as initial w for LSVM training

Merge components

Root filter placement and component choice are latent

Learning model structure

Add parts to cover high-energy areas of root filters

Continue training model with LSVM

Learning model structure

without orientation clustering

with orientation clustering

Learning model structure

In summary

– repeated application of LSVM training to models of increasing complexity

– structure learning involves many heuristics (and vision insight!)

top related