detecting and segmenting humans in crowded scenesmikel/acm multimedia rodriguez.pdf · posture...

Detecting and Segmenting Humans in Crowded Scenes

Mikel Rodriguez SullivanUniversity of Central Florida4000 Central Florida BlvdOrlando, Florida, 32816

[email protected] 2nd. author

Mubarak ShahInstitute for Clarity in Documentation

4000 Central Florida BlvdOrlando, Florida, [email protected]

ABSTRACTWe describe an approach for detecting and segmenting hu-mans with extensive posture articulations in crowded videosequences. In our method we learn a set of posture clus-ters, and a codebook of local shape distributions for humansin various postures. Instances of the codebook entries castvotes for locations of humans in the video and their respec-tive postures. Subsequently, consistent hypotheses are foundas maxima within a voting space. Finally, the segmentationof humans in the scene is initialized by the correspondingposture clusters and contours are evolved to obtain preciseand consistent segmentations.

Our experimental results indicate that the framework pro-vides a simple yet effective means for aggregating shape-based cues. The proposed method is capable of detectingand segmenting humans in crowded scenes as they performa diverse set of activities and undergo a wide range of artic-ulations within different contexts.

Categories and Subject DescriptorsI.4.8 [Image Processing and Computer Vision]: SceneAnalysis—Object recognition; I.4.6 [Image Processing andComputer Vision]: Segmentation

General TermsAlgorithms

KeywordsObject Recognition, Human Detection, Segmentation.

1. INTRODUCTIONThe ability to accurately detect and segment humans in

video sequences represents an essential component in a widerange of application domains such as dynamic scene analysis,human-computer interface design, driver assistance systems,

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MM’06, October 23–27, 2006, Santa Barbara, California, USA.Copyright 2006 ACM 1-59593-447-2/06/0010 ...$5.00.

Figure 1: A typical crowded real-world urban scenealong with a series of detections from our method.

and the development of intelligent environments. Neverthe-less, the problem of human detection has numerous chal-lenges associated with it. Effective solutions must be ableto account not only for the nearly 250 degrees of freedomof which the human body is capable[14], but also the vari-ability introduced by various factors such as different cloth-ing styles, and the presence of occluding accessories such asbackpacks and briefcases. Furthermore, a significant per-centage of scenes, such as urban environments, contain sub-stantial amounts of clutter and occlusion.

Despite these challenges, detecting humans within videosequences has constituted an active area of research for anumber of years, resulting in the proposal of numerous ap-proaches. Nevertheless, only a small subset of the existingmethods has been demonstrated to be effective in the pres-ence of considerable overlaps and partial occlusion in videosequences, such as those seen in Figure 1.

The shape of the human silhouette is often very differentfrom the shape of other objects in a scene. Therefore, shape-based detection of humans represents a powerful cue whichcan be integrated into existing lines of research. As opposedto the appearance-based models, human shapes tend to besomewhat isotropic. Hence, shape-based methods coupledwith other cues, such as motion, can provide a discriminat-ing factor for recognition.

In this paper we address the challenge of detecting andsegmenting humans in video sequences containing crowdedreal-world scenes. Due to the difficulty of this problem, re-liance on any single model or feature alone would be inef-

Figure 2: The main steps in human detection. (a) Input frame, (b) Foreground blobs, (c) Contours extractedfrom the foreground blobs are depicted at the top. The bottom row depicts the votes cast by codebookinstances, representing hypotheses of human centroids in the scene, (d) Final detection and segmentation.

fective. Therefore, successful approaches must be capableof integrating global and local cues by drawing upon thestrengths of several approaches by using shape as well asmotion cues.

The organization of the paper is as follows: In the nextsection we present an overview of related work. In section3 we present our recognition approach and a class-specificsegmentation method. We describe the experiments that weperformed and their corresponding results in section 4, andwe conclude by discussing future work in section 5.

2. RELATED WORKCurrently, the most prevalent class of methods present in

the literature is the detector-style method, in which detec-tors are trained to search for humans within a video sequenceover a range of scales. A number of these methods use globalfeatures such as edge templates [3], while others build clas-sifiers based on local features such as SIFT-like descriptors[6], Haar wavelets [11] and the histogram of oriented gradi-ent descriptor [2].

Another family of approaches models humans as a collec-tion of parts [5], [7], [8] and [9]. Typically this class of ap-proaches relies on a set of low-level features which producea series of part location hypotheses. Subsequently, infer-ences are made with respect to the best assembly of existingpart hypotheses. Approaches such as AdaBoost have beenused with some degree of success to learn body part detec-tors such as the face [10], hands, arms, legs, and torso [5][8].While this class of approaches is attractive, detection ofparts is itself a challenging task. This is particularly diffi-cult in the class of scenes in which we are interested, whichconsist of crowded scenes containing significant occlusionamongst many parts.

A considerable amount of work has also focused on shape-based detection. Zhao et al [16] use a neural network that istrained on human silhouettes to verify whether the extractedsilhouettes correspond to a human subject. However, a po-tential disadvantage of the approach resides in the fact thatthey rely on depth data to extract the silhouettes. Our workis similar in principle to the framework presented in [4], inwhich a patch-based approach is used to learn an implicitshape model for walking humans. Others, such as Davis etal [15] have also attempted to make use of shape-based cuesby comparing edges to a series of learned models. Wu et al

[12] have proposed learning human shape models and rep-resenting them via a Boltzmann distribution in a MarkovField. Although a number of these methods have proved tobe successful in detecting humans in still images, most ofthem assume isolated human subjects with a minimal pres-ence of clutter and occlusion. We are interested in methodsthat enable effective use of shape-based cues in the pres-ence of heavy occlusion of the kind present in most urbanenvironments.

3. APPROACHIn this section we describe our approach for detecting

and segmenting humans in cluttered video sequences. Themethod is based on a probabilistic formulation that capturesthe local shape distribution of humans in various postures.

The method begins by learning a set of posture clusterswhich are used to initialize segmentation. Additionally, welearn a codebook of local shape distributions based on hu-mans in the training set. When the system is presentedwith a testing video sequence, it extracts contours from theforeground blobs in each frame, samples them using shapecontext, finds instances of the learned local shape codebook,and casts votes for human locations and their respective pos-tures in the frame. Subsequently, the system searches forconsistent hypotheses by finding maxima within the votingspace. Given the locations and postures of humans in thescene, the method proceeds to segment each subject. Thisis achieved by placing the characteristic silhouette corre-sponding to the posture cluster of every consistent hypoth-esis around the centroid vote.

3.1 LearningGiven a a set of training video sequences, we perform

background subtraction and extract silhouettes by perform-ing edge detection on foreground blobs corresponding to hu-mans in the scene. Each silhouette is sampled at d points(in our experiments d ranges from 1000-3000). For a pointpi on the silhouette, we identify a distribution over relativepositions by computing a coarse histogram hi of the rel-ative coordinates of the d − 1 points that remain. Thesehistograms are concatenated into a single shape vector forevery silhouette. Shape vectors are then clustered into nclusters using K -means, resulting a set of posture clusters.

Once the posture clusters are created the second phase of

the process consists of learning a codebook of local shapesand their spatial distribution for different poses and scales.We use the same silhouettes extracted in the previous stepfrom the training set. These silhouettes are sampled us-ing Shape Context descriptors [1]. Subsequently, all of theShape Context descriptors are clustered into M clusters us-ing K -means. The similarity of the shape descriptors isgiven by the χ2 distance between the two K -bin histogramsg(k) and h(k). χ2 is given by:

χ2 =1

2

K∑

k=1

[g(k)− h(k)]2

g(k) + h(k). (1)

For each of the M clusters we store the cluster center asa codebook entry. We then learn the spatial distributionof the codebook entries for different postures. This is doneby iterating through all of the foreground blobs from thetraining set, sampling each silhouette via shape context de-scriptors, and matching these against codebook entries. Foreach instance of a codebook entry we record two pieces ofinformation: The position with respect to the centroid ofthe human silhouette on which it occurred, and the closestposture cluster to which the silhouette belongs.

3.2 Detection and segmentationGiven a testing video sequence we sample the silhouettes

which are extracted from the foreground blobs produced bybackground subtraction. These samples are then comparedto the learned codebook of local shapes. If a match is found,the corresponding codebook entry casts votes for the pos-sible centroid of a human in the scene (Figure 3) and aposture cluster to which it belongs. Votes are aggregatedin a continuous voting space and Mean-Shift is used to findmaximums. The various steps associated with detection aredepicted in Figure 2.

Let s be our local shape observed at location l. If smatches to a set of codebook entries Ci, each activationCi will be weighted by the probability p(Ci|s, l). Codebookactivations will cast votes for the possible location x and pos-ture hn of humans within the scene, each vote is weightedby p(hn, x|Ci, l). This can be expressed as:

p(hn, x|s, l) =∑

i

p(hn, x|Ci, l)p(Ci|s). (2)

Here the first term corresponds to the vote for a human ina given posture at a specific location, and the second termcorresponds to the χ2 distance between the local shape andthe codebook instance.

Given a set of hypotheses hi corresponding to (hn, x), wesearch for consistent votes by integrating hypotheses withina search window W (x) using Mean-Shift. Thus, the strengthof the hypothesis is determined by:

score(hi) =∑

k

∑

xj∈W (x)

p(hn, xj |sk, lk). (3)

After selecting the strongest set of hypotheses for humanlocations and their postures contours are initialized by plac-ing the silhouettes associated with the posture cluster hn atlocation x in the video frame. Subsequently, contours areevolved using a level-set based segmentation approach pro-posed in [13]. Both texture and color features are used inorder to guide the segmentation of humans present in thescene (Figure 4).

Figure 3: Codebook instances (which are depictedas small colored dots on the contour) vote for thelocation of humans within the scene.

Figure 4: Given a hypothesis for x and hn, segmenta-tion is initialized with the stores posture cluster sil-houette. (a) The posture cluster silhouette to whichthe subject on the right belongs. (b) Segmentationbased only on foreground pixels results in a inac-curate segmentation. (c) When the segmentation isinitialized by the posture cluster silhouette and con-tours are evolved we obtain improved precision andconsistency.

4. EXPERIMENTS AND RESULTSWe evaluated the performance our system for detecting

and segmenting humans on a set of challenging video se-quences containing significant amounts of partial occlusion.Furthermore, the videos included in the testing procedurefeatured humans performing a diverse set of activities withindifferent contexts, such as walking on a busy city street,running a marathon, playing soccer, and participating in acrowded festival.

Our training set ranged from 700 to 1,100 frames from thevarious video sequences containing human samples. Meta-data in the training set included the centroid of each silhou-ette, along with the posture cluster to which it belonged.Our testing video sequences contained a total of 312 hu-mans for which the torso is visible. The size of the humansacross the video sequences averaged 22x52 pixels. Figure 5shows some examples from the data set.

The quantitative analysis of the method centered aroundmeasuring correct detection as well as the reported locationsof humans within the scene. We employed the evaluationframework proposed in [4], which consist of three criteria:The first criteria, relative distance, measures the distancebetween bounding box centers with respect to the size of theground truth box. The second and third criteria (cover andoverlap) measure the common area between the detectedbounding box and the ground truth. In our experimentsa detection is classified as being correct if the deviation inrelative distance is less than 25% and less then 50% for coveran overlap.

In addition to evaluating the detection results of the methodwe also assessed the effect of the training set size. This was

Figure 5: Example detections from the testing setare depicted by a colored bounding box and posture-based segmentations.

achieved by varying the number of frames used to learn thespatial distribution of local shapes. Figure 6 depicts the re-ceiver operating characteristic (ROC) curve for the differentmodes.

Our detection results show that the method achieves highrecognition rates and performs reliable segmentations, de-spite the presence of significant occlusions in the scene. Onthe most challenging video sequence the system achievedon average a 75.3% recognition rate. Whereas on the bestvideo we achieved an recognition rate of 94%. The falsepositive rate was low, most of the erroneous detections werecaused by moving vertical structures that resembled the hu-man shape. Given the difficulty of the data set, these resultsare encouraging.

As demonstrated by Figure 6, the effect of the size of thetraining set on the performance of the method was minimal.Although an overall increase in performance is observed witha larger training set, a reduction of the training set sizedoes not lead to a drastic decrease in the performance ofthe method.

5. CONCLUSIONSWe have presented a framework for detecting and seg-

menting humans in real-world crowded scenes. Detection isperformed via a codebook of local shape distributions forhumans in various postures. Additionally, a set of learnedposture clusters aids the segmentation process.Our exper-iments indicate that local shape distribution represents apowerful cue which can be integrated into existing lines ofresearch ranging from detecting humans to tracking algo-rithms.

As future work we plan to extend this framework to in-clude motion information as a cue for inferring postures andthus improve our detection and segmentation results. Bysampling the magnitude of a blurred optical-flow field wehope to learn a distribution of local motion that will aid theprocess of detecting and segmenting humans.

6. REFERENCES[1] S. Belongie, J. Malik, and J. Puzicha. Shape matching

and object recognition using shape contexts. PAMI,24(4):509–522, 2002.

Figure 6: Recognition performance based on a rangeof training set sizes.

[2] N. Dalai, B. Triggs, I. Rhone-Alps, andF. Montbonnot. Histograms of oriented gradients forhuman detection. CVPR, 1, 2005.

[3] D. Gavrila. Pedestrian detection from a movingvehicle. ECCV, 2:37–49, 2000.

[4] B. Leibe, E. Seemann, and B. Schiele. Pedestriandetection in crowded scenes. CVPR, 1, 2005.

[5] A. Micilotta, E. Ong, and R. Bowden. Detection andTracking of Humans by Probabilistic Body PartAssembly. Proc. of British Machine VisionConference, September, 2005.

[6] K. Mikolajczyk, C. Schmid, and A. Zisserman. Humandetection based on a probabilistic assembly of robustpart detectors. ECCV, 1:69–81, 2004.

[7] D. Ramanan, D. Forsyth, and A. Zisserman. Strike apose: tracking people by finding stylized poses.CVPR, 1, 2005.

[8] T. Roberts, S. McKenna, and I. Ricketts. Human poseestimation using learnt probabilistic region similaritiesand partial configurations. ECCV, 4:291–303, 2004.

[9] R. Ronfard, C. Schmid, and B. Triggs. Learning toparse pictures of people. ECCV, 4:700–714, 2002.

[10] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features. CVPR, 1:511–518,2001.

[11] P. Viola, M. Jones, and D. Snow. DetectingPedestrians Using Patterns of Motion andAppearance. International Journal of ComputerVision, 63(2):153–161, 2005.

[12] Y. Wu and T. Yu. A Field Model for HumanDetection and Tracking. CVPR, 28, 2006.

[13] X. Yilmaz, A. Li and Shah. Contour-Based ObjectTracking with Occlusion Handling in Video AcquiredUsing Mobile Cameras. PAMI, 26(11):1531–1536,2004.

[14] V. Zatsiorsky. Kinetics of Human Motion. HumanKinetics, 2002.

[15] L. Zhao and L. Davis. Closely Coupled ObjectDetection and Segmentation. ICCV, 1, 2005.

[16] L. Zhao and C. Thorpe. Stereo-and neuralnetwork-based pedestrian detection. IntelligentTransportation Systems, IEEE Transactions on,1(3):148–154, 2000.

detecting and segmenting humans in crowded scenesmikel/acm multimedia rodriguez.pdf · posture...

Documents