computer vision and pattern recognition group - people counter: …sarkar/pdfs/people... ·...

People Counter: Counting of Mostly StaticPeople in Indoor Conditions

Amit Khemlani, Kester Duncan, and Sudeep Sarkar

Abstract. The ability to count people from video is a challenging problem. The sci-entific challenge arises from the fact that although the task is relatively well-defined,the imaging scenario is not well constrained. The background scene can be uncon-trolled along with the illumination being complex and varying. Additionally, thespatial and temporal image resolution is usually poor. The context of most worksin people counting is in counting pedestrians from single frames in outdoor settingsor moving subjects in indoor settings from standard frame rate video. There is littlework done on counting of persons in varying poses, who are mostly static (sitting,lying down), in very low frame rate video (4 frames per minute), and under harsh il-lumination variations. In this chapter, we explore a design that handles illuminationissues at the pixel level using photometry-based normalization, and pose and low-movement issues at feature level by exploiting the spatio-temporal coherence that ispresent among small body part movements. The motion of each body part, such asthe hands or the head, will be present even in mostly static poses. These short du-ration motions will occur spatially close together over the image location occupiedby the subject. We accumulate these using a spatio-temporal autoregressive (AR)model to arrive at blob representations that are further grouped into people counts.We show quantitative performance on real datasets.

Amit KhemlaniComputer Science and Engineering, University of South Florida, 4202 E. Fowler Avenue,Tampa FL 33620, now at Healthesystems, Tampa, Florida

Kester DuncanComputer Science and Engineering, University of South Florida, 4202 E. Fowler Avenue,Tampa FL 33620e-mail: [email protected]

Sudeep SarkarComputer Science and Engineering, University of South Florida, 4202 E. Fowler Avenue,Tampa FL 33620e-mail: [email protected]

C. Shan et al. (Eds.): Video Analytics for Business Intelligence, SCI 409, pp. 133–159.springerlink.com c© Springer-Verlag Berlin Heidelberg 2012

[email protected]

[email protected]

134 A. Khemlani, K. Duncan, and S. Sarkar

1 Introduction

1.1 Motivation

Counting people is a challenging scientific problem that has many practical applica-tions. These include monitoring the number of people sitting in front of a televisionset, counting the number of people in an elevator, counting people passing throughsecurity doors in malls, and counting the number of people working in a laboratory.In general, people counting is important in surveillance-based applications, marketresearch, and people management. Manually counting people given a sequence ofimages is a very tedious process and automation helps. There are many work onpeople detection that work when the pose of the person can be constrained, e.g. inthe case of pedestrian detection. Another class of approach to people counting isthrough detection of motion in a video sequence. These approaches assume peo-ple would be in motion for a substantial amount of time and use normal frame ratevideo. These assumptions are not suited for static scenes such as people working ona computer or watching television, where we have the issues of uncontrolled back-ground scene, poor or complex illumination, poor image resolution (e.g. 320x240),and low frame rate (e.g. 4 frames per minute). In this chapter, we demonstrate meth-ods to deal with the task of counting people under these challenging conditions.

1.2 State of the Art

Tables 1 and 2 categorize the work that has been done in people counting in termsof the application context, the degree of illumination it can handle, the computa-tional methods employed to solve the problem, and the performance of the work.It also contains work done in people tracking that can be applied to count peo-ple. Most approaches use some form of background subtraction or frame to framedifferencing which makes them susceptible to illumination changes. They exploitonly the spatial information to count people thereby disregarding the importance oftemporal information which could prove to be a very useful cue to deal with illu-mination changes. Many approaches use a setup in which an overhead camera isinstalled [18, 19, 20, 26, 1, 12]. Some use skin color to detect people [22]. Someworks use static images to count people and some use video. The frame rates usedranges from 1 frame per second to 30 frames per second (fps). The lowest frame rateused was 1 fps [19]. In some instances, multiple sensors have been used to facilitatepeople counting. Infra-red cameras have also been used to detect irides, resulting inpeople counts [25, 7].

For low-frame rates, counting can be done by detecting people in each frameas in [15], where limbs and faces are separately detected using Quadratic SupportVector Machines (SVM) classifiers and then fused. There are also many pedestriandetection works in the literature that are also based on machine learning based ap-proaches. These approaches works well for upright people. We present an approachthat does not make assumptions about the posture of the person. The number of

People Counter: Counting of Mostly Static People in Indoor Conditions 135

possible configurations is large. In some cases there are subjects sitting with theirbacks to the camera. In this case, face and leg detection are not possible.

There is little work done in people counting when people are not moving around.Additionally, there is little work done that can handle harsh variations in illumi-nation conditions. We address counting relatively static people using a static cam-era and we also deal with varying illumination conditions. We address these issuesthrough pixel normalization and the use of spatiotemporal coherence informationobtained from the video sequence.

Table 1 Prior work done in people counting, categories into the application context, nature ofillumination variations they can handle, algorithm employed, and the nature of performanceachieved.

Work Context Illumination Method Performance

Conrad and John-sonbaugh, 1994 [4]

Overhead view,moving people

Varying Frame differencing 95.6% accuracy fora sampling of 7491people

Rossi and Bozzoli,1994 [18]

Top-down view at asecurity gate, mov-ing people

Fixed Frame differencing, vertical projec-tion, texture analysis within blob

20 to 40 persons,90% correct

Schofield et al.,1996 [19]

Top-down view ofelevator

Fixed Threshold based on complex back-ground model learnt using neuralnetworks

1 fps up to 7 people

Segen and Pingali,1996 [20]

Elevated views at amall, moving per-sons

Fixed Track blobs, group short traces intolarger tracks

14 fps, couple ofpeople

Kuno, Watanabe,Shimosakoda, andNakagawa, 1996[13]

Horizontal view ofpeople walking andnon-humans

Fixed Background subtraction, thresholdusing extraction function, get 3 pro-jection histograms and calculate 3features for counting people andhandling occlusion

2500 images, 98%humans detectedcorrectly.

Stillman et al.,1998 [22]

Fixed and Pan-Tilt-Zoom cameras

Fixed Skin-tone detection, face blob de-tection based on vertical and hori-zontal projection

2 persons

Kettmaker andZabih, 1999 [11]

Multiple camerasalong corridor,moving person

Fixed Moving blobs found using groupingof optic flow, matching across cam-eras using bipartite matching

4 cameras, 13 peo-ple

Terrada, Yoshida,Oe, and Yam-aguchi, 1999 [26]

Stereo camera hungfrom ceiling

Fixed Template matching between space-time images from two cameras togenerate 3D data, two pairs ofspace-time images are generated todetect direction

22 people in and 21people out

Beymer, 2000 [1] Stereo camera on adoor and pointingdown

Fixed Calibrate and specify VOI at setup,reconstruct the scene in 3D andselect disparity pixels (occupancymap) in a VOI and Gaussian mix-tures to model multiple people

20-25 fps, 905enter/exit events, 4hrs of data, errorrate=1/4%

Lee, 2000 [14] Horizontal view ofoffice scenes

Fixed Frame differencing, vertical projec-tion, upper silhouette enhancementby voting, shape analysis to detectheads

3 people

Jabri et al., 2000[10]

Horizontal views ofwalking people

Slow illumina-tion change

Background (adaptive) subtraction,edge subtraction

1 person in 14 se-quences under twodifferent illumina-tion conditions


Table 2 (Continued) Prior work done in people counting, categories into the applicationcontext, nature of illumination variations they can handle, algorithm employed, and the natureof performance achieved.

Work Context Illumination Method Performance

Mohan, Papageor-giou, and Poggio,2001 [15]

Horizontal view ofwalking people

Fixed Haar wavelet transform, QuadraticSVM for identifying componentsand SVM for a combination classi-fier

Haritaoglu, Flick-ner, 2001 [7]

Horizontal view ofpeople waiting forcashier

Fixed Silhouette-based and motion-basedpeople detection. Uses 2 IR lightsources for iris detection

39 people, 25-30fps, 55 minutes ofdata

Huang, Chow, andChau, 2002 [9]

View of a station Fixed Image segmentation (sub-block),feature extraction (sub-curve) andRBF networks

Up to 92.5% accu-racy

Kim et al., 2002[12]

Top-down view ofsecurity gate, mov-ing people

Fixed Frame differencing and backgroundsubtraction, convex hulls of blobs,tracking and counting

10 fps, 6 people

Yang. Gonzalex-Banos, and Guibas,2003 [25]

8 sensors Fixed Background subtraction, planarprojection of silhouette, intersec-tion of each projection, removesmall polygons and phantoms,report new bounds on the numberof objects

15 fps

Ramoser et al.,2003 [17]

Horizontal viewsurveillance system

Fixed Standard adaptive backgroundmodels, CC analysis, blob filteringand model fitting

2 sequences of 500frames each

Zhao amd Nevatia,2004 [28]

Top-down view ofoutdoor scene

Varying Blob extraction, Back-ground/Camera/Human models

99% detection ac-curacy

Celik et al., 2006[3]

Top-down view ofmall scene

Fixed Foreground/Background extraction,Gaussian Mixture Model

25 fps and 80% ac-curacy

Ramanan andForsyth, 2007 [16]

Indoor and outdoorhorizontal views

Varying Human appearance model 90% average detec-tion on many se-quences

Shengsheng et al.,2008 [27]

Overhead camera Varying Foreground/Background EdgeModel

97.8% accuracy

Fehr et al., 2009 [5] Top-down view ofoutdoor scene

Varying Gaussian Mixtures, Shadow re-moval

30 fps

Zhao et al., 2009[29]

Horizontal View ofindoor scenes

Small changes Kalman Filter, Object tracking, An-gle histogram, Face Detection

93% accuracy

Ya-Li Hou, andGrantham K.H.Pang, 2011 [8]

Static camera over-looking a large area

Varying Neural Network, Background es-timation using Gaussian MixtureModels, Expectation-Maximization

10 fps, 90% accu-racy

1.3 Overview of the Approach

Our approach, depicted in Figure 1, relies on the fact that subjects tend to shiftor move body parts and are rarely absolutely stationary for an extended period oftime. It exploits this observation and first detects frame to frame motion blobs by acombination of frame to frame differencing and motion accumulation. These blobsare then grouped spatiotemporally (in XYT) by treating the image sequence as athree dimensional signal, two image dimensions + time, to form counts of peoplefor each frame. The approach primarily comprises of four levels of processing.


Fig. 1 Outline of the approach.

1. The first level makes the sequence of images robust to illumination changes. Thisis achieved by normalizing the images by dividing each pixel by the maximumin its neighborhood. The normalized images are to an extent less susceptibleto illumination changes. After obtaining the normalized images, frame to framedifferencing is performed and the resulting images are thresholded.

2. Autoregressive (AR) model-based accumulation is used as a second level of pro-cessing. The AR model helps to accumulate the temporal and spatial motion toform good 3D blobs that can be clustered by subsequent levels. The output of theAR model is then thresholded.

3. Connected component analysis is the third level of processing and is performedon the threshold-AR images to identify the blobs. This is performed not onlyspatially, but also temporally. This is done by treating the sequence of imagesas a block of images. The connected component step is performed to distinguish


blobs from each other and serves as input to the blob grouping step. The firstthree levels form the Blob Detection phase of the approach.

4. Blob grouping is necessary because there are many blobs corresponding to a singleperson that need to be grouped together to generate an accurate people count.

The approached outlined here is quite robust. Our experiments, which we willpresent later, show that it is not easily susceptible to lighting changes. Given anempty scene with just lighting change it usually produces a count of zero. It canmaintain count under varying illumination conditions. It can count people even ifthey are partially visible. It is worthwhile mentioning that that counts are generatedfor any moving objects in the scene. It does not yet try to distinguish between hu-mans and non-humans. Most counting errors are concentrated around frames withlarge motion events, such as a person moving out from a scene.

2 Blob Detection

Blob detection involves intensity image normalization and motion accumulation.The normalizing step is applied to the original images and is important to make theimage robust to illumination variations to a certain extent. Once the images are ro-bust to illumination changes, motion accumulation is applied. The motion of eachbody part is temporary. However, fleeting motion strands will occur spatially closetogether for the image location occupied by the subject. The motion accumulatorshould be able to integrate the motion cues that are found by frame to frame differ-encing both spatially and temporally.

2.1 Image Intensity Normalization

The intensity normalization idea relies on the approximate model of the image in-tensity I(x,y, t) in an image given by

I(x,y, t) = I0(t)ρ(x,y)nT s (1)

where,

• I0(t): is the incident intensity,• ρ(x,y): is the surface albedo at location (x,y),• n: is the surface normal at the location (x,y), and• s: is the direction to the light source.

Assuming that we are observing mostly static scenes with some degree of motion,the variation in intensity is primarily due to the variation in the intensity of theoverall illumination. The ratio of the intensities of two nearby pixels is invariant tothe overall intensity. This ratio is expressed mathematically as

NI(x1,y1, t) =I(x1,y1, t)I(x2,y2, t)

=ρ(x1,y1)nT

1 sρ(x2,y2)nT

2 s. (2)


This ratio will tend to pick up spatial changes in albedo and surface normal, butwill be stable with respect to overall illumination changes. It will be one for regionswhere there is no change in albedo or surface normal. To impart some amount of nu-merical stability to the computation process, we consider the neighboring pixel to bethe one with the maximum intensity in a fixed-size neighborhood around the pixelunder consideration. We experimented with 3x3, 5x5, and 7x7 fixed-size neigh-borhoods. Figure 2 shows some normalized image frames generated using a 7x7window along with the corresponding original intensity frames. Note how the il-lumination related shading gradient across the walls are removed. The features ofthe objects in the third shelf of the book case on the left, which are in the dark inthe original image, gain prominence in the normalized images. Figure 3 shows thatframe differencing of original intensities detects changes due to illumination varia-tions whereas the frame difference of normalized intensities tends to be robust withrespect to illumination changes. The original image difference is not robust to illu-mination change and this is shown by the red ellipse in the threshold difference ofthe original image. Corresponding to that region in the threshold difference of thenormalized image the illumination change is not captured. Though the normalizedimage is robust to illumination change to a certain extent, the thresholded differenceof the normalized image does introduce some noise. This is shown within the greenellipse in Figure 3. This noise is mostly few pixels in size and can be easily removedby morphological operations.

Original images

Normalized images

Fig. 2 Sample frames showing the locally normalized image intensity representation.


Fig. 3 Frame difference of original intensities versus frame difference of normalizedintensities.

2.2 Motion Accumulation Using Autoregressive Models

A mechanism to accumulate intermittent motion that occurs as parts of a subjectmoves is required. The motion of each body part, such as the hands or the head,will be present temporarily. However, lots of short duration motion strands will oc-cur spatially close together for the image location occupied by the subject. Themotion accumulator should be able to integrate, both spatially and temporally, themotion cues that are found by frame to frame differencing. It should possess suffi-cient “memory” to compensate for the lack of motion at a pixel for a few minutes.It should be able to “forget” short duration motion events that occur only over afew frames. This can be achieved fairly easily using a spatiotemporal autoregres-sive (AR) filtering of the normalized-image intensity differences. Let g(x,y, t) de-note the corresponding autoregressive (AR) filtered output, which is mathematicallyexpressed in 3 as


g(x,y, t) =C000 d(x,y, t)+T

∑k=1

∑(i, j)εR

Ci jk g(x+ i,y+ j, t− k) (3)

where,

• Ci jk: are AR coefficients,• T: is the temporal order of the AR process,• R: represents the spatial order in terms of the local neighborhood considered, and• d(x,y,t): denotes a pixel at location (x,y) in the tth time frame of the normalized-

image difference.

Fig. 4 Motion history accumulation using the autoregressive (AR) model.

It is important to note that the coefficients should sum to 1 so that the gray levelof the pixels are maintained within 0 and 255; ∑Ci jk = 1. In our experiments, wechoose all coefficients effect for C000 to be the same. Figure 4 depicts the operationspictorially for T = 1 and R = 3x3 neighborhood around a pixel. Note that the opera-tions involve only local pixels, both spatially and temporally, making it suitable forhardware implementation. Even though the operations are local, the effect of a pixelin d(x,y, t) persists, mediated through g(x,y, t). Using z-transforms, it can be easilyshown that the fall in “memory” is exponential, both spatially and temporally. Therate of fall of the exponential function is controlled by the value of the coefficients.

Figure 5 shows a sample output of the AR filtering process. To fully appreciatethe effects, one has to view the video sequence. The results reduce the emphasisof changes that happen only in a few frames over time such as those caused by awalking person and enhances statistical changes that are repeated in a local vicinity,such as those caused by a stationary person in the scene.

The memory of the AR process can be a disadvantage when there are randomintensity fluctuations, either due to severe illumination change or due to a personwalking past the camera. The effect of these events can persist for some frames.To handle such cases, the AR process is reset whenever such drastic changes in the


Fig. 5 The left column shows the original frames, the middle column shows the output of theframe to frame differencing of the normalized images and the right column shows the outputof the autoregressive process.

scene are detected, as reflected in the change of the normalized-intensity histogramof the image. The change in the entropy of the normalized intensity histogram ofthe frames is monitored. Let Pt(k)|k = 0, · · · ,1.0 represent the histogram for thetth frame, normalized to sum to one. The dissimilarity between the distributionsfrom two time instants is quantified by the decrement of entropy that is achievedby combining or merging them as compared to considering them separately. If theprobabilities governing the random variables are similar, then the decrement of en-tropy would be low, implying a minor dissimilarity between the random variables.Mathematically, it is given by

ΔH(t, t + 1) = H(12(Pt(n)+Pt+1(n)))− 1

2(H(Pt(n))+H(Pt+1(n))) (4)

where,

• Pt(n): is the histogram of intensity of normalized image at time ‘t’,• Pt+1(n): is the histogram of intensity of normalized image at time ‘t+1’,• H(Pt(n)): is the entropy of intensity of normalized image at time ‘t’:

H(Pt(n)) =−ΣnPt(n) logPt(n),


• H(Pt+1(n)): is the entropy of intensity of normalized image at time ‘t+1’:

H(Pt+1(n)) =−ΣnPt+1(n) logPt+1(n)

,• H( 1

2(Pt(n)+Pt+1(n))): is the entropy of the “merging” of the histogram of theintensities from the two image frames:

H(Pt(n)+Pt+1(n)

2) =−∑

n[(

Pt(n)+Pt+1(n)2

) log(Pt(n)+Pt+1(n)

2)],

and• n: size of the bin of the histogram.

If the intensities in the individual frames have not changed drastically, then the dis-tribution of the merged frames will be the same as that for the individual frames andthe decrement of entropy would be zero. Wong and You [23] first used this entropicdistance to measure similarity between two random graphs. Note that other mea-sures exist for determining the distance between probability distributions such asthe mutual information-based KL measure and the Bhattacharya distance. However,both these measures require that the support set of the two probability measures orthe domain of non-zero probabilities be the same. This is not always possible in ourcontext as the same set of intensity values might not exist in two the images.

The blob detection procedure is as: For each frame(k) in a sequence

1. Normalize the image frame (Nk)2. Compute difference image: Dk = Nk −Nk−1

3. Threshold Dk to produce dk

4. Compute AR filtered image g(x,y, t):

if ΔH(t, t + 1)< Ht

g(x,y, t) =C000 d(x,y, t)+∑Tk=1 ∑(i, j)εR Ci jk g(x+ i,y+ j, t− k)

elseg(x,y, t) = 0

Whenever the change in entropy of the intensity distributions is large, as comparedto a threshold (Ht), the AR process is reset. This ensures that the effect of drasticchanges in a few frames does not persist. Conversely, resetting the AR process canresult in the temporal disconnection of genuine motion blobs for people watchingtelevision or similar settings. This problem is addressed by the subsequent step ofblob grouping which consolidates them.

3 Spatiotemporal Grouping into Counts

Following the detection of the motion pixels, a mechanism is required to group themotion pixels corresponding to a single person blobs that form our count. The first


step is to label each XYT blob. This is done by performing connected componentanalysis on an image sequence by treating it as a block of data. After the blobsare labeled, spatiotemporal grouping is performed to group similar blobs together.Before describing the grouping algorithm we will discuss some of the concepts ofperceptual organization. There are three strategies for the spatiotemporal groupingproblem [2]:

• Sequential Approach: This is the most commonly used method in computer vi-sion. In this method the incoming frames of the video sequence are first processedto extract any spatial grouping. The spatial groupings are then grouped using thetemporal information.

• Interactive Approach: In this method the temporal grouping performed is used toadjust the spatial grouping.

• Multidimensional Approach: In this method the image sequence is treated as avolume and the entire block is processed at one time. This method is by far thebest strategy. The amount of computing power and the memory required to workdirectly on a three-dimensional signal has been a major concern, but this has beenalleviated by recent technological advances. We adopt this approach.

XYT Connected Component Analysis: After motion accumulation, connectedcomponent analysis [6] is performed to identify the 3D blobs. In our method weapply connected component analysis not only in 2D but also in the third dimensionof time i.e. we treat the entire image sequence as a block of data and apply con-nected component analysis to the entire block at once rather than to each image sep-arately. Therefore, the number of neighbors increases to 26 based on 8-connectivity:9 neighbors each from time t-1 and t+1, and 8 neighbors from time t.

XYT Blob Grouping: The XYT blobs found by connected component labelingare set up as vertices of the graph and links (edges) are established between blobs.The links possess weights based on the proximity measure. The smaller the dis-tance between the blobs, the more similar they are to each other and vice versa. Theproximity factor given below, captures the spatiotemporal proximity of the blobs indistance.

Aij =−exp((i− j)T (i− j)

3√

ϑ13√

ϑ2) (5)

where,

• i = [xi,yi, ti]: Center of XYT Blob i,• j = [x j,y j , t j]: Center of XYT Blob j, and• ϑi, ϑ j: Volume of Blobs i and j.

Note that the weights are high for blobs that are close together both spatially andtemporally. The distance bridged is proportional to the size of the blobs. Largerblobs will be associated with other blobs over a longer distance than smaller blobs.The second step involves the partitioning of this graph into node clusters that arestrongly connected to each other.


Graph Cuts: Once the links and vertices are established, we partition the graphinto clusters based on the disassociation measure. Ideally the graphs are cut alongthose links with less weight to form a good set of connected components. Thereare various disassociation measures to cut the graph: minimum cut, normalized cut,average cut, and ratio cut. Consider a graph G = (V,E) that needs to be partitionedinto two sets, A and B such that: A

⋃B = V and A

⋂B = /0, by removing the edges

connecting the two sets. This is called the cut [24] and the cut measure is defined as.

cut(a,b) = ΣuεA,vεBw(u,v) (6)

The optimal way to bipartition the graph is to minimize the cut value. There is a ma-jor drawback in using the minimum cut criterion because it has the tendency to cutsmall sets of isolated nodes in the graph. To overcome this problem the normalizedcut [21] is used. The normalized cut computes the cut cost as a fraction of the totaledge connections to all the nodes in the graph as opposed to the minimum cut whichonly computes the value of the total edge weight connecting the two parts. The twomost important properties of grouping: minimizing the disassociation between clus-ters and maximizing the association within cluster are thus satisfied simultaneously.The normalized cut criterion can be approximated efficiently by the following equa-tion [21]:

(D−W)y = λ Dy, (7)

where D is an N ×N diagonal matrix with d on its diagonal (let d(i) = ∑ j w(i, j) bethe total connection from node i to all other nodes), and W is an N ×N symmetricalmatrix with W (i, j) = weight between nodes i and j. This is a generalized eigenvalueequation.

4 Experiments and Results

We show results on actual home datasets and laboratory datasets. The data fromhomes have images in which subjects are watching television and in the laboratorydata we have subjects working in front of computers. The frame rate of the video is4 frames per minute and images have 320x240 resolution. The images have peoplesitting, walking, talking and entering or exiting. Since the sequences span severaldays, they also have varying illumination conditions as another factor. In the homedataset we have children playing and also pets, which make counting of peoplemore challenging. We quantified the performance of the people counter by usingconfusion matrices that uses manually generated people counts as ground-truth.

4.1 Result on Home Datasets

The first set of results we present are on two Nielsen Home datasets. These datasetshave very poor quality images, with significant amounts of illumination variations,i.e. lights turned on and off, night vs. day, curtains opened and closed, etc. In somecases, there are individuals walking across the scene. People appear in different


poses in these images including reclining postures. Due to privacy reasons, we can-not show the images or visual results on this data.

In the first dataset (Home Dataset I) we have the images from a single home.The single camera is installed on top of a TV set to get a view of the living roomalong with the system that grabs and stores the frames. The number of sequenceswe have processed spans 5 days. In the second dataset (Home Dataset II) we haveimages from 4 homes. The data was collected using the B/W CCTV high resolutionex-view cube video security camera. It has 600 lines of resolution and uses a 3.7mm lens. Since the camera used is an analog camera, Sensoray Model 611, a PCI-bus based frame grabber was used to grab and store the images. The size of theimages is 320 x 240 and the images are grabbed once every 15 seconds. The datahas images from the entire day so it covers various illumination conditions. It haspeople sleeping, kids playing in the room, windows which enable sunlight to comeinto the room, lamps, people in the dark watching TV, lights turning on and off.

We selected 45 sequences spread over 5 days from Home Dataset I. Each se-quence consisted of 50 frames, resulting in a total of 2250 frames. For each frame,we manually noted the number of persons. From Home Dataset II, we selected 40sequences obtained from these 4 homes to quantify the performance. Each sequenceconsisted of 50 frames, resulting in a total of 2000 frames. Table 3 presents the re-sults of comparing the algorithm’s counts with the manual counts in the form of aconfusion matrix on these two datasets. Each row corresponds to a count producedby the algorithm and the columns correspond to the manually generated ground-truth count. A person was counted in the ground truth if any body part was visible.The entries of the matrix shows the number of frames with the corresponding pairsof counts. The diagonal denotes correct performance. Over-counts are in the lowertriangle and the under-counts are in the upper triangle. The percentages are com-puted over the respective columns and represent the detection rates for scenes withdifferent numbers of people in it. There are more under-counts than over-counts.

Table 3 Performance on home datasets I and II as captured by confusion matrices that com-pare automated counts with manually generated ground truth counts.

(a) Home Dataset I (b) Home Dataset II


Table 4 Performance of the approach without using the autoregressive accumulation processon home datasets I and II as captured by confusion matrices that compare automated countswith manually generated ground truth counts.

(a) Home Dataset I (b) Home Dataset II

In Table 4 we show the corresponding counts without the autoregressive his-tory accumulation process. We see that the percentage of correct detection of emptyscenes increased, however, the detection rate for scenes with people is lower. Thisattests to the importance of the autoregressive motion history accumulation process.

4.2 Laboratory Dataset

To be able to show examples of images and intermediate processing we also col-lected data of people sitting and working in the computer vision labs: ENB 218,ENB 348 and ENB 221. The data collection in ENB221 and ENB218 was doneusing the Canon Optura camera, whereas the data collection in ENB348 was doneusing the twin cameras used in the home dataset. Although the Canon Optura datacollected was in WMV format sampled at 30 fps per seconds, we dropped framesto achieve the same frame rate as of the Home Datasets, i.e. 4 frames per minutes.We collected data of each laboratory for a duration of 5 hours. From this, we usedsnippets of the data collected based on the number of people, different illuminationconditions and different types of activities. The number of people we have in ourdataset vary from none to three. We have subjects sitting, working, entering andleaving. We have some illumination changes as the day progresses.

The dataset consists of 25 sequences. The sequences were chosen to capture dif-ferent illumination conditions, different activities, and different number of subjects.Each sequence has a total of 50 images. Therefore there was a total of 1250 images.In the following sections we will show the results of some sequences covering il-lumination changes, subjects leaving and entering the laboratory, issues regardingmoving objects other than humans, and the number of subjects present in the image.


4.2.1 Illumination Changes

Figure 6 shows some images from a sequence that is affected by illuminationchanges. The illumination change is prominent on the ceiling of the room. Fig-ure 7 presents another example. The algorithm presented in this chapter is able tomaintain an accurate people count even under varying illumination conditions. Thenumber of different colors (for the blobs) represent the number of people.

In Figure 8 the subject enters and there are illumination changes. Note that al-though the blob shape do not correspond exactly to the shape of the person at thattime instant. The count is maintained as the shape changes.

Figure 9 shows a sequence in which the light is initially off. The light comeson and a subject is sitting and working. In such a case, if we had used only frameto frame differencing we would have incorrectly acquired the entire image as thedifference image. In our case, we also monitor the entropy of each image whichhelps determine whether to continue or stop the autoregressive process.

4.2.2 Different Activity

In Figure 10 a subject enters the room and later leaves. The subject’s motion blobpersists for a while due to the combined effect of frame to frame differencing andthe autoregressive model. This causes over-counting for only a few frames after thesubject leaves. This effect can be reduced by adjusting the autoregressive modelcoefficients.

In Figure 11 one of the subjects is partially visible. We are able to include him inthe count because we only use motion as the cue to detect people, we do not detectany specific feature such as the face, hands, etc. Such approaches would have failedin this case.

Figure 12 is an example of three subjects sitting and working. Initially in thesequence there are only two subjects; the third subject enters towards the end of thesequence. The entrance event of the subject is not captured by the low-frame ratevideo. The subject sitting on the right side is partially visible for some frames in thesequence.

We quantify performance using confusion matrix of counts on a per image framebasis. One axis corresponds to the groundtruth of the number of subjects and theother corresponds to the computer count of the number of subjects. Table 5 showsthe confusion matrices for the three laboratory datasets. The parameter values werethe same across all datasets. The correct detection rates range from 74% to 100%.

5 Parameter Variations

We have experimented with parameter values and evaluated its effect on the peoplecount. There are three major parameters: the window of the AR model used to ac-cumulate motion, the entropy threshold parameter used to reset the AR process forlarge changes (Ht), and the AR coefficient C000. The rest of the AR coefficients are


(a)

(b)

(c)

Fig. 6 Two subjects sitting in the laboratory. On left are the original images, and on right arethe grouped spatio-temporal blob outputs. The two colors signify the blobs resulting in thecount of two.


(a)

(b)

(c)

Fig. 7 One subject is sitting and the other subject is standing. On left are the original images,and on right are the grouped blob outputs. The two colors signify the blobs resulting in thecount of two.


(a)

(b)

(c)

Fig. 8 A subject enters the laboratory and sits down. On left are the original images, and onright are the blob outputs.


(a)

(b)

(c)

Fig. 9 Initially it is dark, then the light comes on and a subject enters. On left are the originalimages, and on right are the outputs.


(a)

(b)

(c)

Fig. 10 Two subjects are sitting and the third subject enters the laboratory in one frame andleaves in the other frame. On left are the original images, and on right are the outputs. Thethree colors signify the blobs resulting in the count of three.


(a)

(b)

(c)

Fig. 11 Two subjects are talking and one is partially visible. On left are the original images,and on right are the outputs. The two colors signify the blobs resulting in the count of two.


(a)

(b)

(c)

Fig. 12 Three subjects are working in the laboratory. On left are the original images, and onright are the outputs. The three colors signify the blobs resulting in the count of three.


Table 5 Performance on the three laboratory dataset as captured by confusion matrices thatcompare automated counts with manually generated ground truth counts.

(a) ENB 218 (b) ENB 221 (c) ENB 348

Table 6 Variation of major parameters used in the people counting algorithm.

AR coefficient Entropy AR windowC000 threshold (Ht )

0.5 0.3 7x70.5 0.3 5x50.5 0.3 3x30.5 0.25 7x70.4 0.3 7x70.6 0.3 7x70.6 0.3 5x50.4 0.3 5x50.4 0.3 3x3

chosen to be the same with the constraint that all coefficients sum to one. On thethree laboratory sequences, we varied these parameter values as listed in Table 6.

Figure 13 plots the different detection rates as the parameters are varied, bunchedtogether for zero, one, and two subjects for each laboratory. The correct detectionof an empty room ranges from 81.2% to 87.5% for ENB218, 75.7% to 86.4% forENB221 and 89.6% to 94.7% for ENB348. The correct detection of one subject inthe room ranges from 74% to 88% for ENB218, 61% to 81% for ENB221 and 87%to 94% for ENB348. The correct detection of two subjects in the room ranges from79.1% to 85.3% for ENB218, 55.4% to 79.5% for ENB221 and 96.7% to 100% forENB348.

The best detection rates for ENB218 is 88.4% for 1 subject and 84.4% for 2subjects. In this case the 5x5 window with the coefficient for the current frame 0.4gave the best detection rates. The best detection rate for ENB221 is 75.5% for 1subject and 79.5% for 2 subjects. In this case the 5x5 window with the coefficientfor the current frame 0.5 gave the best detection rates. The best detection rate forENB348 is 94.8% for 1 subject and 100% for 2 subjects. In this case the 3x3 windowwith the coefficient for the current frame 0.4 gave the best detection rates. In the


(a) ENB 218 (b) ENB 221 (c) ENB 348

Fig. 13 Effects of the variation of parameter values on the correct detection rates for differentnumber of persons in the ground truth.

ENB348 dataset we have only 62 frames with 2 subjects. Hence the 100% detectionrate may be lesser than shown if more frames with 2 subjects are included.

6 Conclusion

We presented a framework for counting mostly static people in very sparsely-sampled video (4 frames per minutes), spanning days and difficult illuminationvariations. This is different from most of the work that has been done in peoplecounting, which has focused primarily on pedestrian counting or counting peopleindoors under static illumination conditions. In the past, the majority of the workin people counting has been done using overhead cameras but in our case we haveused cameras that are placed horizontally.

The issue of people counting was handled at both the pixel level using photometry-based normalization, and at the feature level, by exploiting the spatio-temporal co-herence that is present in the image sequence. The grouping was conducted us-ing the XYT volume of data to exploit the temporal coherence present in motion.We worked with two types of indoor settings: home and laboratory. The datasetssometimes had drastically varying illumination conditions under which the algo-rithm presented in this chapter generated the right counts. The results are very en-couraging. The algorithm is robust with respect to illumination changes. Counts canbe calculated under severe illumination conditions. People counts can be calculatedeven if they are partially visible. Counts are generated for any moving object in thescene. It does not try to distinguish between humans and non-humans. Countingerrors are concentrated around frames with large motion events, such as a personmoving out of a scene.

References

1. Beymer, D.J.: Person counting using stereo. In: Workshop on Human Motion, pp. 127–134 (2000)


2. Boyer, K., Sarkar, S.: Perceptual Organization for Artificial Vision Systems. Kluwer(March 2000)

3. Celik, H., Hanjalic, A., Hendriks, E.: Towards a robust solution to people counting. In:IEEE International Conference on Image Processing, pp. 2401–2404 (October 2006)

4. Conrad, G., Johnsonbaugh, R.: A Real-time People Counter. In: ACM Symposium onApplied Computing, pp. 20–24. ACM, New York (1994)

5. Fehr, D., Sivalingam, R., Morellas, V., Papanikolopoulos, N., Lotfallah, O., Park, Y.:Counting people in groups. In: IEEE International Conference on Advanced Video andSignal Based Surveillance, pp. 152–157 (September 2009)

6. Haralick, M., Shapiro, G.: Computer and Robot Vision. Addison-Wesley Longman Pub-lishing Co., Inc. (1992)

7. Haritaoglu, I., Flickner, M.: Attentive billboards. In: Conference on Image Analysis andProcessing, pp. 162–167 (2001)

8. Hou, Y.-L., Pang, G.: People counting and human detection in a challenging situation.IEEE Transactions on Systems, Man and Cybernetics, Part A 41(1), 24–33 (2011)

9. Huang, D., Tommy, W., Chow, S., Chau, W.: Neural network based system for countingpeople. In: Industrial Electrical Society, pp. 2197–2201 (2002)

10. Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A.: Detection and location of peopl in videoimages using adaptive fusion of color and edge information. In: International Conferenceon Pattern Recognition, pp. 627–630 (2000)

11. Kettnaker, V., Zabih, R.: Counting people from multiple cameras. In: IEEE InternationalConference on Multimedia Computing and Systems, pp. 267–271 (1999)

12. Kim, J., Choi, K., Choi, B., Lee, J., Ko, S.: Real-time system for counting the number ofpassing people using a single camera., 466–473 (2003)

13. Kuno, Y., Watanabe, T., Shimosakoda, Y., Nakagawa, S.: Automated detection of humanfor visual surveillance system. In: International Conference on Pattern Recognition, p.C92.2 (1996)

14. Lee, M.: Detecting people in cluttered indoor scenes. In: Computer Vision and PatternRecognition, pp. 804–809 (2000)

15. Mohan, A., Papageorgiou, C.P., Poggio, T.: Example-based object detection in imagesby components. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(4),349–361 (2001)

16. Ramanan, D., Forsyth, D., Zisserman, A.: Tracking people by learning their appearance.IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1), 65–81 (2007)

17. Ramoser, H., Schlogl, T., Beleznai, C., Winter, M., Bischof, H.: Shape-based detectionof humans for video surveillance applications. In: International Conference on ImageProcessing, pp. III 1013–III 1016 (2003)

18. Rossi, M., Bozzoli, A.: Tracking and counting moving people. In: International Confer-ence on Image Processing, pp. 212–216 (1994)

19. Schofield, A., Mehta, P., Stonham, T.: A system for counting people in video imagesusing neural networks to identify the background scene. In: International Conference onPattern Recognition, vol. 29(8), pp. 1421–1428 (1996)

20. Segen, J., Pingali, G.: A camera-based system for tracking people in real time. In: Inter-national Conference on Pattern Recognition, pp. 63–67 (1996)

21. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on PatternAnalysis and Machine Intelligence 22(8), 888–905 (2000)

22. Stillman, S., Tanawongsuwan, R., Essa, I.: Tracking multiple people with multiple cam-eras. In: Workshop on Perceptual User Interfaces (1998)

23. Wong, A., You, M.: Entropy and distance of random graphs with application to struc-tural pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 7(5), 599–609 (1985)


24. Wu, Z., Leahy, R.: An optimal graph theoretic approach to data clustering: Theory and itsapplication to image segmentation. IEEE Transactions on Pattern Analysis and MachineIntelligence 15(11), 1101–1113 (1993)

25. Yang, D.B., Gonzalez-Banos, H.H.: Counting people in crowds with a real-time networkof simple image sensors. In: International Conference on Computer Vision, pp. 122–129(2003)

26. Yoshida, D., Terada, K., Oe, S., Yamaguchi, J.: A method of counting the passing peopleby using the stereo images. In: International Conference on Image Processing, p. 26AP3(1999)

27. Yu, S., Chen, X., Sun, W., Xie, D.: A robust method for detecting and counting people.In: International Conference on Audio, Language and Image Processing, pp. 1545–1549(July 2008)

28. Zhao, T., Nevatia, R.: Tracking multiple humans in complex situations. IEEE Transac-tions on Pattern Analysis and Machine Intelligence 26(9), 1208–1221 (2004)

29. Zhao, X., Delleandrea, E., Chen, L.: A people counting system based on face detectionand tracking in a video. In: IEEE International Conference on Advanced Video and Sig-nal Based Surveillance, pp. 67–72 (September 2009)

computer vision and pattern recognition group - people counter: …sarkar/pdfs/people... ·...

Documents