ieee transactions on image processing, vol. 27, no. 3 ... · ieee transactions on image processing,...

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 3, MARCH 2018 1049

Body Structure Aware Deep Crowd CountingSiyu Huang, Xi Li, Zhongfei Zhang, Fei Wu, Shenghua Gao, Rongrong Ji, and Junwei Han

Abstract— Crowd counting is a challenging task, mainly dueto the severe occlusions among dense crowds. This paper aimsto take a broader view to address crowd counting from theperspective of semantic modeling. In essence, crowd countingis a task of pedestrian semantic analysis involving three keyfactors: pedestrians, heads, and their context structure. Theinformation of different body parts is an important cue to helpus judge whether there exists a person at a certain position.Existing methods usually perform crowd counting from theperspective of directly modeling the visual properties of eitherthe whole body or the heads only, without explicitly capturingthe composite body-part semantic structure information that iscrucial for crowd counting. In our approach, we first formulatethe key factors of crowd counting as semantic scene models.Then, we convert the crowd counting problem into a multi-tasklearning problem, such that the semantic scene models are turnedinto different sub-tasks. Finally, the deep convolutional neuralnetworks are used to learn the sub-tasks in a unified scheme.Our approach encodes the semantic nature of crowd countingand provides a novel solution in terms of pedestrian semanticanalysis. In experiments, our approach outperforms the state-of-the-art methods on four benchmark crowd counting data sets.The semantic structure information is demonstrated to be aneffective cue in scene of crowd counting.

Index Terms— Crowd counting, pedestrian semantic analysis,visual context structure, convolutional neural networks.

Manuscript received December 14, 2016; revised May 15, 2017; acceptedAugust 6, 2017. Date of publication August 15, 2017; date of current versionDecember 5, 2017. This work was supported in part by NSFC under Grant61672456, Grant U1509206, and Grant 61472353, in part by the FundamentalResearch Funds for Central Universities in China, in part by the ZhejiangProvincial Engineering Research Center on media data cloud processing andanalysis technologies, in part by the ZJU Converging Media ComputingLaboratory, in part by the Key Program of Zhejiang Province under Grant2015C01027, in part by the National Basic Research Program of China underGrant 2015CB352302, and in part by the Alibaba-Zhejiang University JointInstitute of Frontier Technologies. The associate editor coordinating the reviewof this manuscript and approving it for publication was Prof. Shuicheng Yan.(Corresponding author: Xi Li.)

S. Huang is with the College of Information Science and Elec-tronic Engineering, Zhejiang University, Hangzhou 310027, China (e-mail:[email protected]).

X. Li is with the College of Computer Science and Technology, ZhejiangUniversity, Hangzhou 310027, China, and also with the Alibaba–ZhejiangUniversity Joint Institute of Frontier Technologies, Hangzhou 310027,China (e-mail: [email protected]).

Z. Zhang is with the College of Information Science and ElectronicEngineering, Zhejiang University, Hangzhou 310027, China, and also withthe Computer Science Department, Watson School, The State University ofNew York Binghamton University, Binghamton, NY 13902 USA (e-mail:[email protected]).

F. Wu is with the College of Computer Science and Technology, ZhejiangUniversity, Hangzhou 310027, China (e-mail: [email protected]).

S. Gao is with the School of Information Science and Tech-nology, ShanghaiTech University, Shanghai 201210, China (e-mail:[email protected]).

R. Ji is with the School of Information Science and Engineering, XiamenUniversity, Xiamen 361005, China (e-mail: [email protected]).

J. Han is with the School of Automation, Northwestern PolytechnicalUniversity, Xi’an 710072, China (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2017.2740160

I. INTRODUCTION

CROWD counting is a challenging task of accuratelycounting the crowds in dense scenes. It has drawn

much attention from researchers because of a series ofpractical demands including crowd control and public safety.As illustrated in the top left of Fig. 1, there is a commoncrowd scene. The occlusions among people are severe and theperspective distortions vary significantly in different areas.In addition, the crowd distributions are visually diverse. Thesedifficulties have restricted the performance of existing crowdcounting methods.

In principle, crowd counting seeks for pedestrian seman-tic analysis involving three key factors: pedestrians, headsand their context structure. Most existing methods focus onmodelling the visual properties of either the whole pedestri-ans or the heads only, while ignoring the context structure ofdifferent body parts which is also significant for counting thecrowds. For instance, when we humans count the pedestrians,we will naturally use the composite body-part semantic struc-ture information as an auxiliary cue to judge whether a headseen by us is exactly a pedestrian at that position or somethingelse. It indicates that the semantic structures of pedestrianscould provide abundant information for recognizing the pedes-trians. However, many existing detection-based crowd count-ing methods [1], [2] model the pedestrians by constructingpedestrian detectors or head-shoulder detectors that they arelimited by the severe occlusions in dense crowds. In morerecent literatures [3], [4], researchers focus on modelling thedensity distributions of pedestrians, while, ignoring the body-part semantic structure information that is essential for thecognition of human beings.

Motivated by the above observations, we address the crowdcounting problem from the viewpoint of semantic modelling inthis work. The three key factors of crowd counting, includingpedestrians, heads, and their context structure, are formulatedas two types of semantic scene models. The first semanticscene model is denoted as the body part map in this paper.It models the visual appearance and context structure ofpedestrian body parts. In body part map, different pedestrianbody parts are formulated as different semantic categories,in the meantime, the spatial context structure of different partsare also formulated into the map. Fig. 1 provides an intuitiveillustration of our approach. The body part map is createdbased on the single pedestrian parsing model [5], which isa pre-trained neural network model calculating the semanticsegmentation mask of an input pedestrian image. And thenwe merge the segmentation masks of all the pedestrians tocreate the body part map. The top right of Fig. 1 shows thebody part map highlighted by the areas of head, body, andlegs respectively with colors of light blue, yellow, and red.

1057-7149 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1050 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 27, NO. 3, MARCH 2018

Fig. 1. Brief illustration of our approach. We build semantic scene modelsincluding the body part map and the structured density map to encode thesemantic nature of a crowd scene. The crowd count is estimated based on thetwo semantic scene models.

The second semantic scene model is denoted as the struc-tured density map in this paper. The conventional densitymaps in existing works [3], [6] are proposed to model thedensity distributions of crowds, while the shapes of indi-vidual pedestrians are ignored. Motivated by this, we createthe structured density map according to specific shapes ofindividual pedestrians which are provided by the body partmap. As an improvement of the conventional density map,the structured density map aims to model more fine-grainedsemantic structure information and so it can provide moreaccurate pixel-wise labels. As illustrated in bottom left ofFig. 1, the structured density map denotes the density informa-tion of crowds, meanwhile, preserving the shapes of specificpedestrians. In summary, the two semantic scene models, bodypart map and structured density map, are proposed to encodethe semantic nature of crowd counting and they recover richsemantic structure information from crowd images.

For the purpose of accurately estimating the count ofpedestrians, we reformulate crowd counting as a multi-tasklearning problem. There are three sub-tasks: inferring twotypes of semantic scene models and estimating the crowdcount. We build deep convolutional neural networks (CNNs) tojointly learn these sub-tasks. The CNNs first model the map-pings from scene image to semantic scene models includingthe body part map and structured density map, followed bycalculating the crowd count based on them. The CNNs areable to extract powerful visual representations from images.The feature extraction and multi-task crowd counting problemare addressed in a unified scheme.

We summarize our main contributions as follows:

1) We provide a novel solution for crowd counting interms of pedestrian semantic analysis. We formulatethree key factors of crowd counting and model them astwo types of semantic scene models. The models recoverrich semantic structure information from images and areeffective in learning our crowd counting framework.

2) We reformulate the crowd counting problem as a multi-task learning problem such that the semantic scene

models are converted into its sub-tasks. We presenta unified framework to jointly learn these sub-tasksbased on the CNNs. Experiments show that our methodachieves better results compared to the state-of-the-artmethods.

II. RELATED WORK

We introduce the literatures related to our work in thissection. We first discuss the crowd counting methods proposedin existing literatures. And then we discuss several relatedworks on pedestrian semantic analysis, as we address thecrowd counting problem from the viewpoint of pedestriansemantic modelling in this paper. In addition, we introducethe background of CNNs, as our crowd counting frameworkis built upon the deep neural networks.

A. Crowd Counting

In general, most of the methods for crowd counting can begrouped into three categories: (1) detection-based, (2) globalregression and (3) density estimation. The earlier literatures ofcrowd counting propose the detection-based methods [7]–[9]to model the semantic structure of pedestrians. Various kindsof detectors are employed to match individual pedestrians inimages. Li et al. [10] use a HOG based head-shoulder detectorto detect heads within foreground areas. Wu and Nevatia [1]detect local human body parts by part detectors and combinetheir responses to form people detections. Lin and Davis [2]learn a generic human detector by matching a part-templatetree to images hierarchically. The detection-based methodsperform better in relatively low dense scenes, while, they arelimited by the heavy occlusions in dense crowds. Differentfrom them, we model the semantic structure of pedestrians assemantic scene models. They are more robust to learn underthe crowded scene and are more suitable for deep learningbased framework.

To overcome the difficulties of detection-based methodsin high dense scenes, researchers take a different way thatthey propose the global regression based methods to learn themapping between low-level features and pedestrian counts.These methods are more suitable for crowded environmentsthan the detection-based approaches. Diverse kinds of low-level features are employed, including textures [11]–[13], edgeinformation [13], [14], and segment shape [12], [15], [16].In addition, regression algorithms including linear regres-sion [17], Bayesian regression [13], ridge regression [12],and Gaussian process regression [15], [18] are commonlyused. The global regression based methods only utilize theinformation of pedestrian counts, while, the spatial informationand body structure information of pedestrians are ignored.

To model the spatial information of pedestrians,researchers [6], [19]–[21] formulate the latent densitydistributions of crowds as an intermediate ground truth,namely, the density estimation based crowd counting.Lempitsky and Zisserman [6] first generate the density mapbased on the annotated points with a 2D Gaussian kernel andlearn a linear regression function between scene image anddensity map. Following their work, other density estimationmethods including random forest [22], [23] and deep neural

HUANG et al.: BODY STRUCTURE AWARE DEEP CROWD COUNTING 1051

networks [3], [4], [24], [25] are proposed. These methodsdemonstrate good performance on crowd counting. Butfrom the perspective of semantic modelling, the body-partstructures of individual pedestrians are ignored in theseapproaches. In this work, we focus on analyzing the semanticnature of crowd counting. We build semantic scene models byrecovering rich semantic structure information from imagesand take them as novel supervised labels for crowd counting.

B. Pedestrian Semantic Analysis

The semantic analysis of pedestrian is an importantprerequisite to many practical applications for intelligentsurveillance systems operating in real world environments,including several typical computer vision topics like pedestriandetection [26]–[28], pedestrian parsing [5], [29] and crowdsegmentation [30], [31]. In recent years, some high-leveltasks of pedestrian analysis have also drawn much attentionfrom researchers in recent years, including action recogni-tion [32], [33], crowd attribute analysis [34], and pedestrianpath prediction [35], [36]. What these approaches have incommon is that they learn and model different aspects ofsemantic structure prior of pedestrians.

In this work, we address the crowd counting problem byfocusing on pedestrian semantic analysis, because the visualcues of pedestrian body-part appearance can provide abundantinformation for recognizing the crowds. The success of parts-based methods [1], [2] on pedestrian detection also demon-strates this idea. Rather than directly detecting the holisticpedestrian, the parts-based methods utilize the information ofpedestrian body structure and is able to handle occlusions morerobustly. Different from the conventional parts-based methods,we formulate the body-part semantic structure of pedestriansas the semantic scene models in our approach, which are moresuitable for learning under deep neural network framework andare more effective and robust in dense crowded scenes.

C. Convolutional Neural Networks

Our crowd counting framework is built upon the CNNs.The CNNs are a popular and leading visual representationtechnique, for they are able to learn powerful and interpretablevisual representations [37]–[42]. Specifically, we use the fullyconvolutional networks (FCNs) to learn the semantic scenemodels proposed in our approach. As a kind of CNN archi-tecture, FCNs are end-to-end models for pixelwise problems.They have given the state-of-the-art performance on manyscene analysis tasks, including scene parsing [43]–[45], crowdsegmentation [31] and action estimation [46]. For crowdcounting, Zhang et al. [4] propose a multi-column FCN tomap the crowd image to the density map. Their models areadaptive to the variations in pedestrian size and achieve thestate-of-the-art performance on the benchmark datasets.

III. OUR APPROACH

A. Problem Formulation

In this work, we aim to address the problem of singleimage crowd counting. Given a crowd image X , our goal isto estimate the pedestrian number C in the image. It can be

TABLE I

THE DETAILED DESCRIPTION OF THE VARIABLES

formulated as a mapping X F−→ C . From the perspective ofsemantic modelling, we reformulate the original problem as amulti-task learning problem that contains three sub-tasks: theinference of two semantic scene models and the estimation ofpedestrian number. The first semantic scene model is the bodypart map B, which is built to model the body-part semanticstructures of pedestrians. The second one is the structured den-sity map D, which is built to model the density distributionsand shapes of pedestrians. Both two models are data-dependentthat they respectively encode different semantic attributes ofa crowd image. To address the multi-task learning problem,we build the CNNs to jointly learn the three sub-tasks in aunified framework. The learning process can be written as

X F1−→ (B,D)F2−→ C , such that B and D are used as auxiliary

ground truths to better estimate C . For reading convenience,we summarize a collection of important notations used in thispaper as Table I. We discuss our approach in more details inthe following subsections.

B. Body Part Map

As one of the semantic scene models, the body part map Bis proposed to model the body-part semantic structures ofindividual pedestrians, which can serve as an important cueto judge whether there exists a person at a certain location.We introduce it into our framework as a novel supervised labelto address the difficulties in crowd counting problem.

The body part map B is generated based on the givenscene image X , perspective map M and the locations of headpoints Pi

h . First, we have to obtain single pedestrian images.Due to the perspective distortions, we use the perspective mapM (given in datasets) to normalize the scales of pedestrians.The pixel value M(p) denotes the number of pixels in theimage representing one meter at location p in the actual scene.With the head location Ph = (Px

h , P yh ) of a person, the top

left corner Ptl and bottom right corner Pbr of the person’sbounding box are estimated as

Ptl = (Px

h − α1M(Ph), P yh − β1M(Ph)

),

Pbr = (Px

h + α2M(Ph), P yh + β2M(Ph)

), (1)


Fig. 2. Illustration of the semantic scene models. (a) is the scene image.The red points on (a) denote the annotations of pedestrian heads. (b) is thebody part map. (c) is the conventional density map created by 2D Gaussiankernels only. (d) is the structured density map generated based on (b) and (c),modelling both the density information and shape information of individualpedestrians.

where the parameters are manually set as α1 = 0.5,α2 = 0.5, β1 = 0.25, β2 = 1.75 in the experiments tobest approximate the actual situations, such that the width ofbounding box is assumed as α1 +α2 = 1 meter and the heightof bounding box is assumed as β1 + β2 = 2 meters in theactual scene.

After obtaining the pedestrian images, we normalize themto the same size followed by inputting them into the sin-gle pedestrian parsing model [5] to calculate their semanticsegmentation masks. The pedestrian parsing model uses adeep neural network to parse a single pedestrian image intoseveral semantic regions, including hair, head, body, legs,and feet. The model is pre-trained that we only use it togenerate the semantic segmentation of pedestrians. We mergethe regions of hair into head, and also merge the regionsof feet into legs. Finally, we resize the semantic masks ofindividual pedestrians to their original sizes to create thebody part map B. Fig. 2(a) shows a crowded scene image,and Fig. 2(b) shows the body part map corresponding tothe scene image. Colors of light blue, yellow, red and darkblue respectively denote areas of head, body, legs and back-ground. The body part maps containing labelled pixels offour categories models the semantic structure of every indi-vidual pedestrian in the scene images. And they are preparedfor learning our crowd counting framework as discussed insubsection III-D.

C. Structured Density Map

The structured density map is proposed to capture both thedensity distributions and shapes of pedestrians. Different fromthe existing crowd counting methods [3], [4], [6], it is data-dependent in our approach that it is generated according tospecific shapes of individual pedestrians.

We first discuss the conventional density map DN proposedin existing works [3] . It is usually created by a sum of 2DGaussian kernels centered on the locations of pedestriansas:

DN (p) =C∑

i=1

1

‖Z‖(N i

h(p; Pih, σ i

h) + N ib(p; Pi

b, σ ib)

), (2)

where Nh is a standard 2D Gaussian kernel for modelling thehead part of a pedestrian and Nb is a bivariate 2D Gaussiankernel for modelling the body part of a pedestrian. p is thelocation of a pixel on DN , Pi

h and Pib are respectively the i -th

locations of person heads and bodies. In order to approximatethe sizes of head and body in actual scene, we manually setthe variance σh of kernel Nh as σh = 0.25M(Ph), and set thevariance σb of kernel Nb as σbx = 0.25M(Pb) and σby =M(Pb). The body location Pb is set as Pb = Ph +0.8M(Ph).‖Z‖ is the normalization factor which normalizes the sum ofdensity values for each person to 1, such that the sum ofdensity values for all the persons is the count C . Fig. 2(c)shows the density map DN corresponding to the scene imagein Fig. 2(a).

Since DN cannot well model the specific shapes of indi-vidual pedestrians, we further propose the structured densitymap D:

DN (p)=C∑

i=1

1

‖Z‖(N i

h(p; Pih, σ i

h)+N ib(p; Pi

b, σ ib)

)·Bm(p)

(3)

The pedestrian mask Bm characterizes the shape of eachpedestrian. It is obtained by binarizing the body part map B,where the pixel values of foregrounds and backgrounds arerespectively set to 1 and 0. The structured density map D iscalculated by the element-by-element multiplication of DNand Bm , followed by normalization. Fig. 2(d) shows thestructured density map generated by our approach. Comparedto the conventional density map in Fig. 2(c), we can see thatthe structured density map not only denotes the latent densitydistributions of crowds but also maintains specific shapes ofevery individual pedestrian.

D. Multi-Task Crowd Counting Framework

To estimate the accurate pedestrian number C , we refor-mulate the original crowd counting problem as a multi-tasklearning problem including three sub-tasks: the inference ofsemantic scene models B and D, and the estimation ofpedestrian number C . To jointly learn these three sub-tasks,we propose a unified multi-task learning framework based onthe CNNs. See Fig. 3 for an illustration of our framework.We take a patch-wise strategy in which the input of networksis an image patch x cropped from scene image X , where x isconstrained to cover a 3-meter by 3-meter square in the actualscene according to the perspective map M. In conventionalneural network based density estimation method [3], thereis only one stream of the mapping from scene patch x to

density map d: xFd−→ d , as illustrated in the blue blocks of

Fig. 3. In our multi-task learning framework, we add another


Fig. 3. Illustration of the proposed networks. The image patches are cropped from the scene image and are inputted into the CNNs. The convolutional layersdenoted in blue blocks are built to infer the density distributions, and they are trained under the structured density map d with Euclidean loss. The layersdenoted in orange blocks are built to infer the pedestrian body-part structures, and they are trained under the body part map b with Softmax loss. Finally,the crowd count c is regressed based on the two types of maps with fully connected layers as denoted in green blocks. Best viewed in color.

stream to learn the body part map b of patch x , as illustratedin the orange blocks of Fig. 3. The outputs of two streamsare concatenated together and are mapped into the pedestriancount c by fully connected neural networks, as illustrated inthe green blocks. As a whole, our crowd counting framework

can be written as: xF1−→ (b, d)

F2−→ c. Three supervised labelsb, d , and c jointly train our model.

We discuss more details about the architectures of our net-works. The networks between patch x and density map d arefully convolutional networks (FCNs) which contain 7 convolu-tional layers. The architecture is convd1(9, 32)− poold1(2)−L RN −convd2(5, 64)− poold2(2)− L RN −convd3(5, 64)−convd4(3, 128) − convd5(3,128)-convd6(1,256)-convd7(1,1),where ‘conv’ represents a convolution layer, ‘pool’ representsa max-pooling layer, and ‘LRN’ represents a local responsenormalization layer. Numbers in the parentheses are respec-tively kernel size and number of channels. Each convolutionallayer is followed by a ReLU activation function and is paddedto keep the same as its last layer. The Euclidean loss is usedto measure the difference between the estimated density mapd and ground truth d , as

Ld = ‖d − d‖22. (4)

The networks between patch x and body part map b are alsoFCNs which contain 9 convolutional layers. The architectureis convb1(9, 32)− poolb1(2)−L RN−convb2(9, 32)−convb3(5, 64)− poolb3(2)− L RN −convb4(5, 64)−convb5(5, 64)−convb6(3, 128) − convb7(3, 128) − convb8(1, 256) −convb9(1,1). We use a sum of Softmax loss at all32×32 positions to measure the difference between theestimated body part map b and ground truth b:

Lb = −32∑

h=1

32∑

w=1

logexp

(b(h, w)

)

∑4i=1 exp(b(h, w, i))

, (5)

where b(h, w) stands for the output of convb9 layer atspatial position (h, w) and channel of ground truth category.b(h, w, i) is the output of convb9 layer at position (h, w) andi -th channel.

The two outputs d and b are concatenated to the fullyconnected layers f c1(512)- f c2(128)- f c3(1) for estimatingthe counts c, where fc represents a fully connected layer.We use Euclidean loss to measure the difference between theestimated count c and ground truth c, as

Lc = (c − c)2. (6)

The three loss functions Ld , Lb and Lc are combined as ajoint multi-task loss L:

L = Lc + λd Ld + λb Lb. (7)

λd and λb are loss weights respectively set to 10 and 1 inexperiments. The whole network is jointly trained under L byback propagation and stochastic gradient descent.

Finally, the count C of an entire image is a sum of allthe patch counts c of the image. Because C is composed oflocal count information within different areas, we further usethe ground truth count C of image to fine-tune the count Cpredicted by our neural networks, as

C = Cω + ε. (8)

We employ the linear regression for fine-tuning. The regressioncoefficients ω and ε are estimated by the training data. In thetesting phase, we use ω and ε to fine-tune the pedestrian countwhich is estimated by the neural networks.

IV. EXPERIMENTS

A. Experimental Setup

We give the details on the networks, datasets, and theevaluation metric in the following.

1) Networks: We use the popular Caffe toolbox [47] toimplement the proposed deep convolutional neural networks.Due to the effect of gradient vanishing for deep neural net-works, it is not easy to learn all the parameters simultaneously.We use a trick in training phase that we first separately pre-train the two CNNs of mapping between the patch and twomaps, and then use the pre-trained weights to initialize theentire CNNs and fine-tune all the parameters simultaneously.


TABLE II

SUMMARIZATION OF FOUR DATASETS

Fig. 4. Example frames of (a) UCSD dataset, (b) UCF_CC_50 dataset, (c) WorldExpo’10 dataset, and (d) Shanghaitech-B dataset.

TABLE III

MEAN ABSOLUTE ERRORS (MAE) OF THE WORLDEXPO’10 DATASET

The network from patch to body part map is trained for 100Kiterations with a batch size of 100 and learning rate of 10−5.The network from patch to structured density map is trainedfor 100K iterations with a batch size of 100 and learningrate of 10−4. Finally, the entire network is initialized withthese pre-trained weights and is trained for 300K iterationswith a batch size of 40 and learning rate of 10−5. The inputimage patches are uniformly resized to 128×128. For a faircomparison with other crowd counting methods, we do notuse pre-trained weights of other deep learning models.

2) Datasets: We evaluate our method in four bench-mark datasets including the WorldExpo’10 dataset [3],the Shanghaitech-B dataset [4], the UCSD dataset [15] andthe UCF_CC_50 dataset [48]. The details of the four datasetsare summarized in Table II, where Scenes is the number ofscenes; Frames is the number of frames; Resolution is theresolution of images; FPS is the number of frames per second;Counts is the minimum and maximum numbers of peoplein the ROI of a frame; Average is the average pedestriancount; Total is the total number of labeled pedestrians. Fig. 4shows example frames of the four datasets. The scenes, crowddensities, crowd distributions, and perspective distortions varysignificantly among these datasets such that they can be usedto comprehensively evaluate the crowd counting methods.

3) Evaluation Metric: By following the convention of exist-ing works [3], [4] for crowd counting, we use the meanabsolute error (MAE) and mean squared error (MSE) toevaluate the performance of different crowd counting methods:

MAE = 1

N

N∑

i=1

|Ci − Ci |, (9)

MSE =√√√√ 1

N

N∑

i=1

(Ci − Ci )2, (10)

where N is the number of test images, Ci and Ci are respec-tively the estimated people count and ground truth peoplecount in the i -th image.

B. WorldExpo’10 Dataset

The WorldExpo’10 dataset [3] contains 1132 annotatedvideo sequences captured by 108 surveillance cameras, allfrom Shanghai 2010 WorldExpo. This dataset provides atotal of 199,923 annotated pedestrians at the centers of theirheads in 3980 frames. The testing dataset includes five videosequences of different scenes, and each video sequence con-tains 120 labeled frames. The regions of interest (ROI) and


Fig. 5. Qualitative results of our method on test scenes of WorldExpo’10 dataset, including (a) scene images, (b) body part maps, (c) density maps, and(d) head density maps. These maps model the body-part structures and the density distributions of crowds.

perspective maps of scenes are provided for the train and testscenes.

Table III reports the MAE errors of different methods onWorldExpo’10 dataset. The results of LBP features basedridge (RR) regression method are listed at the top row.Zhang et al. [3] propose the Crowd CNN model to estimatethe density maps and crowd counts of image patches basedon deep neural networks. Results of their model are listed atthe second row. The third row lists the results of the CrowdCNN model with the scene-specific fine-tuning techniquewhich utilizes the information of test scenes. Zhang et al. [4]propose a Multi-column CNN (MCNN) model which uses fil-ters of different sizes to estimate the geometry-adaptive densitymap. Results of their model are listed at the fourth row. Thelast row lists the results of our method. Our method achievesthe best performance in terms of average MAE. In scene 3,4, and 5, our method achieves the best performance comparedwith the other methods. It indicates that the semantic structureinformation modelled by our method is effective in differentscenes. In scene 2, the performance of our method is relatively

worse, mainly because the crowds of this scene are extremelydense within small areas such that there are very few body-partsemantic cues. As a comparison, the fifth row lists the resultsof our method using conventional density map instead ofstructured density map. The introduction of structured densitymap makes 11% improvement over conventional density mapfor our model.

Fig. 5 shows some qualitative results of our method on thetest scenes of WorldExpo’10 dataset. The figures are respec-tively the (a) scene images, (b) body part maps, (c) densitymaps and (d) head density maps from left to right. In thebody part maps, colors of light blue, green, yellow and darkblue respectively denote the regions of head, body, legs andbackground. The body part maps show that our method candetect precise pedestrian body parts even under the heavyocclusions among crowds. The density maps are constructedto model the density distributions of pedestrians. The headdensity maps are inferred based on the body part maps anddensity maps, indicating that our method is also able topredict precise locations and densities of pedestrian heads.


TABLE IV

COMPARISON OF DIFFERENT METHODS ON THESHANGHAITECH-B DATASET

Thus the accurate pedestrian counts are estimated based onthese effective body part maps and density maps.

C. Shanghaitech-B Dataset

The Shanghaitech-B dataset is a part of Shanghaitechdataset which was first introduced by Zhang et al. [4]. It con-tains 716 annotated images which are taken from differentcameras at the busy streets of metropolitan areas in Shanghai.This dataset has a total of 88,488 annotated pedestrians at thecenters of their heads. In this dataset, 400 images are usedfor training and 316 images are used for testing. Becausethe perspective maps are not provided and the perspectivedistortions of scenes are similar among scenes, we manuallycreate a single perspective map which is used for all theimages.

Table IV reports the MAE and MSE errors of differentmethods on the Shanghaitech-B dataset. Following the con-vention of Zhang et al. [4], we compare our method withthe LBP+RR method, the Crowd CNN method [3], and theMCNN method [4]. The MCNN-CCR is the MCNN modeltrained without the ground truth of the density map. Ourmethod outperforms the state-of-the-art methods by largemargins in terms of both the MAE and MSE. It indicates thatour method has a good generalization capability over manydifferent scenes. Compared to MCNN-CCR which is basedon pedestrian count regression, the MCNN method performsmuch better because it preserves more density information ofthe image. Likewise, our method proves better performancethan MCNN because we further capture the pedestrian body-part structure information to help improve the count accuracy.

D. UCSD Dataset

The UCSD dataset [15] contains 2000 frames of a singlescene. The video in this dataset is recorded at 10 fps with theframe size of 158×238. The crowd density of this datasetis relatively low that there are only about 25 persons onaverage in each frame. The annotations of pedestrian headlocations and ROI are provided. Following the convention ofthe existing works [4], [15], we use frames 601-1400 as thetraining data, and the remaining 1200 frames as the test data.Since the perspective map is not provided in this dataset andthe perspective distortions of the scene are not severe, we fixthe perspective values of all the pixels to the same.

Table V reports the MAE and MSE errors of our method andother methods [3], [4], [12], [15], [49], [50]. Four hand-craftedfeatures based regression methods are compared in Table V,

TABLE V

COMPARISON OF DIFFERENT METHODS ON THE UCSD DATASET

TABLE VI

COMPARISON OF DIFFERENT METHODS ON THE UCF_CC_50 DATASET

including the kernel ridge regression [49], the multi-outputridge regression [12], the Gaussian process regression [15],and the cumulative attribute regression [50]. Results of theCNN based density estimation methods [3], [4] are also listedin Table V. Our method outperforms both the regression basedmethods and the CNN based methods in terms of MAE.The MSE of our method is a little larger than the MCNNmethod, mainly because the multi-size filters proposed intheir methods are more robust for datasets without annotatedperspective values. Except the MCNN method, other methodsincluding ours do not specifically optimize the perspectivedistortions. Our method outperforms these methods by largemargins in terms of both two metrics, because there areabundant pedestrian body-part information in relatively low-density scenarios. It also demonstrates the effectiveness ofsemantic scene models proposed in this paper.

E. UCF_CC_50 Dataset

The UCF_CC_50 dataset [48] contains 50 images of dif-ferent scenes. It is very challenging, because of not only thelimited number of images, but also the extremely dense crowdsin images. Following the conventional setting [48], we split thedataset randomly and perform 5-fold cross validation. Becauseof the limitation of the training samples, we randomly crop1000 patches from each image for training. The perspectivevalues are fixed as the perspective maps are not provided.

Table VI reports the MAE and MSE errors of ourmethod and the other methods [3], [4], [6], [19], [48].Rodriguez et al. [19] propose the density map estimationto improve the head detection performance in crowd scenes.Lempitsky and Zisserman [6] learn the density regressionmodel based on dense SIFT features and the MESA distance.Idrees et al. [48] estimate the crowd counts based on multi-source features. The deep learning based approach [3] arealso evaluated on this dataset. Our method performs the


Fig. 6. Comparison of different ground truths on WorldExpo’10 dataset.

best in terms of MAE, indicating that the semantic struc-ture information is still effective in extremely dense scenes.In addition, we also evaluate the performance of our modelwithout the global fine-tuning operation described as Eq. 8 insubsection III-D. The results show that the global fine-tuningoperation is able to help improve the performance of the patch-wise based crowd counting models.

F. The Effectiveness of Different Supervised Labels

There are three different ground truth supervised labels usedin this work, including the structured density map D, the bodypart map B and the pedestrian number C . We comparesthe effectiveness of themselves in crowd counting, as shownin Fig. 6. We evenly group the training images and testingimages of WorldExpo’10 dataset into 10 groups according toincreasing pedestrian number. The vertical axis denotes theMAE error in each group. The black curve represents thedirect regression network trained by C only. The purple curverepresents the network trained by conventional density mapDN and C . The blue curve represents the network trained bystructured density map D and C . The green curve representsthe network trained by B and C . The red curve represents ourwhole model which is trained by all the three ground truthsB, D, and C .

The whole model performs the best on both the training setand test set. It indicates that the joint modelling of differentsemantic attributes of crowd images provides an effectiveand robust solution for crowd counting. The direct regressionnetwork performs well on the train set but the worst onthe test set. It is in line with our intuition that the crowdcounting model trained by only the count information is shortof generalization capability. On the test set, the body part mapnetwork performs better than the direct regression network,indicating that the body part map is an effective supervisedlabel in crowd counting. While, it performs a little worse thanthe density map networks, indicating that the density mapmay make bigger contribution to our framework than bodypart map. In addition, the structured density map performsbetter than conventional density map in most cases, indicatingthat the structured density map can help improve the crowdcounting performance.

In addition, Fig. 7 shows more qualitative results of ourcrowd counting models which are trained by different super-vised labels. From left to right, the figures are respectivelythe scene images, the ground truth body part maps, the bodypart maps inferred by our model, and the pedestrian numbersestimated by different models. The scene images are of the test

Fig. 7. From left to right: scene images, ground truth body part maps, bodypart maps generated by our method, and estimated pedestrian numbers. Bestviewed in color.

set of WorldExpo’10 dataset. The body part maps inferred byour model can well model the body-part semantic structures ofpedestrians in scenes of different crowd densities and diversecrowd distributions, thus helps estimate the accurate pedestriannumbers.

V. CONCLUSION

In this paper, we have presented a novel approach to accu-rately estimate the count of crowds in images. Our approachhas focused on discovering the semantic nature of crowdcounting. We have built two semantic scene models to recoverrich semantic structure information from images. In addition,we have reformulated the crowd counting problem as a multi-task learning problem such that the semantic scene modelshave been turned into different sub-tasks. We have built theCNNs to jointly learn these sub-tasks in a unified scheme.In experiments, our approach has achieved better performancecompared to the state-of-the-art methods on four benchmarkdatasets.

REFERENCES

[1] B. Wu and R. Nevatia, “Detection of multiple, partially occluded humansin a single image by Bayesian combination of edgelet part detectors,”in Proc. IEEE ICCV, vol. 1. Oct. 2005, pp. 90–97.

[2] Z. Lin and L. S. Davis, “Shape-based human detection and segmentationvia hierarchical part-template matching,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 32, no. 4, pp. 604–618, Apr. 2010.

[3] C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd countingvia deep convolutional neural networks,” in Proc. IEEE Conf. CVPR,Jun. 2015, pp. 833–841.

[4] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowdcounting via multi-column convolutional neural network,” in Proc. IEEEConf. CVPR, Jun. 2016, pp. 589–597.


[5] P. Luo, X. Wang, and X. Tang, “Pedestrian parsing via deep decompo-sitional network,” in Proc. IEEE ICCV, Dec. 2013, pp. 2648–2655.

[6] V. Lempitsky and A. Zisserman, “Learning to count objects in images,”in Proc. Adv. NIPS, 2010, pp. 1324–1332.

[7] W. Ge and R. T. Collins, “Marked point processes for crowd counting,”in Proc. IEEE Conf. CVPR, Jun. 2009, pp. 2913–2920.

[8] M. Wang and X. Wang, “Automatic adaptation of a generic pedes-trian detector to a specific traffic scene,” in Proc. IEEE Conf. CVPR,Jun. 2011, pp. 3401–3408.

[9] V. B. Subburaman, A. Descamps, and C. Carincotte, “Counting peoplein the crowd using a generic head detector,” in Proc. IEEE Conf. AVSS,Sep. 2012, pp. 470–475.

[10] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number ofpeople in crowded scenes by mid based foreground segmentation andhead-shoulder detection,” in Proc. IEEE ICPR, Dec. 2008, pp. 1–4.

[11] A. N. Marana, L. F. Costa, R. A. Lotufo, and S. A. Velastin, “Onthe efficacy of texture analysis for crowd monitoring,” in Proc. IEEESIBGRAPI, Oct. 1998, pp. 354–361.

[12] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localisedcrowd counting,” in Proc. BMVC, 2012, vol. 1. no. 2, p. 3.

[13] A. B. Chan and N. Vasconcelos, “Counting people with low-levelfeatures and Bayesian regression,” IEEE Trans. Image Process., vol. 21,no. 4, pp. 2160–2177, Apr. 2012.

[14] D. Kong, D. Gray, and H. Tao, “A viewpoint invariant approachfor crowd counting,” in Proc. IEEE ICPR, vol. 3. Aug. 2006,pp. 1187–1190.

[15] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preservingcrowd monitoring: Counting people without people models or tracking,”in Proc. IEEE Conf. CVPR, Jun. 2008, pp. 1–7.

[16] D. Ryan, S. Denman, C. Fookes, and S. Sridharan, “Crowd countingusing multiple local features,” in Proc. IEEE DICTA, Dec. 2009,pp. 81–88.

[17] N. Paragios and V. Ramesh, “A MRF-based approach for real-timesubway monitoring,” in Proc. IEEE Conf. CVPR, vol. 1. Jun. 2001,pp. I-1034–I-1040.

[18] M. von Borstel, M. Kandemir, P. Schmidt, M. K. Rao, K. Rajamani,and F. A. Hamprecht, “Gaussian process density counting from weaksupervision,” in Proc. ECCV, 2016, pp. 365–380.

[19] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert, “Density-aware per-son detection and tracking in crowds,” in Proc. IEEE ICCV, Nov. 2011,pp. 2423–2430.

[20] Y. Wang, Y. X. Zou, J. Chen, X. Huang, and C. Cai, “Example-basedvisual object counting with a sparsity constraint,” in Proc. IEEE ICME,Jul. 2016, pp. 1–6.

[21] Y. Wang and Y. Zou, “Fast visual object counting via example-baseddensity estimation,” in Proc. IEEE ICIP, Sep. 2016, pp. 3653–3657.

[22] L. Fiaschi, U. Köthe, R. Nair, and F. A. Hamprecht, “Learning tocount with regression forest and structured labels,” in Proc. IEEE ICPR,Nov. 2012, pp. 2685–2688.

[23] V.-Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada, “COUNTforest: Co-voting uncertain number of targets using random forestfor crowd density estimation,” in Proc. IEEE ICCV, Dec. 2015,pp. 3253–3261.

[24] E. Walach and L. Wolf, “Learning to count with CNN boosting,” inProc. ECCV, 2016, pp. 660–676.

[25] C. Arteta, V. Lempitsky, and A. Zisserman, “Counting in the wild,” inProc. ECCV, 2016, pp. 483–498.

[26] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:An evaluation of the state of the art,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 4, pp. 743–761, Apr. 2012.

[27] Q. Ye, Z. Han, J. Jiao, and J. Liu, “Human detection in images viapiecewise linear support vector machines,” IEEE Trans. Image Process.,vol. 22, no. 2, pp. 778–789, Feb. 2013.

[28] A. Satpathy, X. Jiang, and H.-L. Eng, “Human detection by quadraticclassification on subspace of extended histogram of gradients,” IEEETrans. Image Process., vol. 23, no. 1, pp. 287–297, Jan. 2014.

[29] Y. Bo and C. C. Fowlkes, “Shape-based pedestrian parsing,” in Proc.IEEE Conf. CVPR, Jun. 2011, pp. 2265–2272.

[30] T. Zhao, R. Nevatia, and B. Wu, “Segmentation and tracking ofmultiple humans in crowded environments,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 30, no. 7, pp. 1198–1211, Jul. 2008.

[31] K. Kang and X. Wang. (2014). “Fully convolutional neural net-works for crowd segmentation.” [Online]. Available: https://arxiv.org/abs/1411.4464

[32] K. Simonyan and A. Zisserman, “Two-stream convolutional networks foraction recognition in videos,” in Proc. Adv. NIPS, 2014, pp. 568–576.

[33] Y. Yan, E. Ricci, S. Subramanian, G. Liu, and N. Sebe, “Multitask lineardiscriminant analysis for view invariant action recognition,” IEEE Trans.Image Process., vol. 23, no. 12, pp. 5599–5611, Dec. 2014.

[34] J. Shao, K. Kang, C. C. Loy, and X. Wang, “Deeply learned attributes forcrowded scene understanding,” in Proc. IEEE Conf. CVPR, Jun. 2015,pp. 4657–4666.

[35] J. Walker, A. Gupta, and M. Hebert, “Patch to the future: Unsu-pervised visual prediction,” in Proc. IEEE Conf. CVPR, Jun. 2014,pp. 3302–3309.

[36] S. Huang et al., “Deep learning driven visual path prediction froma single image,” IEEE Trans. Image Process., vol. 25, no. 12,pp. 5892–5904, Dec. 2016.

[37] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classificationwith deep convolutional neural networks,” in Proc. Adv. NIPS, 2012,pp. 1097–1105.

[38] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in Proc. ECCV, 2014, pp. 818–833.

[39] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEEConf. CVPR, Jun. 2015, pp. 1–9.

[40] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, andS.-F. Chang, “An exploration of parameter redundancy in deep networkswith circulant projections,” in Proc. IEEE Conf. CVPR, Jun. 2015,pp. 2857–2865.

[41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proc. IEEE Conf. CVPR, Jun. 2016, pp. 770–778.

[42] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, “CNNpack: Packingconvolutional neural networks in the frequency domain,” in Proc. Adv.NIPS, 2016, pp. 253–261.

[43] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in Proc. IEEE Conf. CVPR, Jun. 2015,pp. 3431–3440.

[44] S. Zheng et al., “Conditional random fields as recurrent neural net-works,” in Proc. IEEE ICCV, Dec. 2015, pp. 1529–1537.

[45] J. Yang, B. Price, S. Cohen, H. Lee, and M.-H. Yang, “Object contourdetection with a fully convolutional encoder-decoder network,” in Proc.IEEE Conf. CVPR, Jun. 2016, pp. 193–202.

[46] L. Wang, Y. Qiao, X. Tang, and L. Van Gool, “Actionness estimationusing hybrid fully convolutional networks,” in Proc. IEEE Conf. CVPR,Jun. 2016, pp. 2708–2717.

[47] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-ding,” in Proc. ACM Multimedia, 2014, pp. 675–678.

[48] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scalecounting in extremely dense crowd images,” in Proc. IEEE Conf. CVPR,Jun. 2013, pp. 2547–2554.

[49] S. An, W. Liu, and S. Venkatesh, “Face recognition using kernel ridgeregression,” in Proc. IEEE Conf. CVPR, Jun. 2007, pp. 1–7.

[50] K. Chen, S. Gong, T. Xiang, and C. C. Loy, “Cumulative attribute spacefor age and crowd density estimation,” in Proc. IEEE Conf. CVPR,Jun. 2013, pp. 2467–2474.

Siyu Huang received the bachelor’s degree ininformation and communication engineering fromZhejiang University, China, in 2014. He is cur-rently pursuing the Ph.D. degree with the Collegeof Information Science and Electronic Engineering,Zhejiang University, Hangzhou, China. His advisorsare Prof. Z. Zhang and Prof. X. li. His currentresearch interests are primarily in computer vision,machine learning, and deep learning.


Xi Li received the Ph.D. degree from the NationalLaboratory of Pattern Recognition, Chinese Acad-emy of Sciences, Beijing, China, in 2009. He iscurrently a Full Professor with the Zhejiang Uni-versity, China. He was a Senior Researcher withthe University of Adelaide, Australia. From 2009 to2010, he was a Post-Doctoral Researcher with CNRSTelecomd ParisTech, France. His research interestsinclude visual tracking, motion analysis, face recog-nition, Web data mining, and image and videoretrieval.

Zhongfei Zhang received a B.S. degree (Hons.) inelectronics engineering, an M.S. degree in infor-mation sciences from Zhejiang University, China,and the Ph.D. degree in computer science from theUniversity of Massachusetts Amherst, USA. He iscurrently a QiuShi Chaired Professor with Zhe-jiang University, China, where he directs with DataScience and Engineering Research Center. He ison leave from The State University of New YorkBinghamton University, USA, where he is also aProfessor with the Computer Science Department

and also directs with the Multimedia Research Laboratory. His researchinterests include knowledge discovery from multimedia data and relationaldata, multimedia information indexing and retrieval, and computer vision andpattern recognition.

Fei Wu received the B.S. degree from the LanzhouUniversity, Lanzhou, Gansu, China, the M.S. degreefrom Macao University, Taipa, Macau, and the Ph.D.degree from Zhejiang University, Hangzhou, China.He is currently a Full Professor with the Collegeof Computer Science and Technology, ZhejiangUniversity. He was a Visiting Scholar with Prof.B. Yu’s Group, University of California, Berkeley,from 2009 to 2010. His current research interestsinclude multimedia retrieval, sparse representation,and machine learning.

Shenghua Gao received the B.E. degree from theUniversity of Science and Technology of China,in 2008, and the Ph.D. degree from the NanyangTechnological University in 2013. He is currently anAssistant Professor with ShanghaiTech University,Shanghai, China. He was awarded the MicrosoftResearch Fellowship in 2010. His research interestsinclude computer vision and machine learning.

Rongrong Ji is currently a Professor and the Direc-tor of the Intelligent Multimedia Technology Lab-oratory, and the Dean Assistant with the Schoolof Information Science and Engineering, XiamenUniversity, Xiamen, China. His work mainly focuseson innovative technologies for multimedia signalprocessing, computer vision, and pattern recognition,with over 100 papers published in internationaljournals and conferences. He is a Member of ACM.He also serves as Program Committee Member forseveral tier-1 international conferences. He was a

recipient of the ACM Multimedia Best Paper Award and Best Thesis Awardwith the Harbin Institute of Technology. He serves as an Associate/GuestEditor for international journals and magazines, such as neurocomputing,signal processing, multimedia tools, and applications, the IEEE MULTIMEDIAMAGAZINE, AND MULTIMEDIA SYSTEMS.

Junwei Han received the Ph.D. degree in pat-tern recognition and intelligent systems from theSchool of Automation, Northwestern Polytechni-cal University, Xian, China, in 2003. He is cur-rently a Professor with Northwestern PolytechnicalUniversity. His current research interests includemultimedia processing and brain imaging analysis.He is an Associate Editor of the IEEE TRANSAC-TIONS ON HUMAN–MACHINE SYSTEMS, NEURO-COMPUTING, AND MULTIDIMENSIONAL SYSTEMSAND SIGNAL PROCESSING.

ieee transactions on image processing, vol. 27, no. 3 ... · ieee transactions on image processing,...

Documents