real-time multiple people tracking for automatic group

Multimed Tools Appl (2011) 51:913–933DOI 10.1007/s11042-009-0423-4

Real-time multiple people tracking for automaticgroup-behavior evaluation in deliverysimulation training

Jungong Han · Peter H. N. de With

Published online: 11 November 2009© The Author(s) 2009. This article is published with open access at Springerlink.com

Abstract This paper aims at generating an automated way to evaluate the team-behavior of trainees in a delivery simulation course using video-processing tech-niques with emphasis on multiple people tracking. The paper is composed of twointeracting, but clearly separated stages: moving people segmentation and multiplepeople tracking. At people segmentation stage, the combination of the GaussianMixture Model (GMM) and the Dynamic Markov Random Fields (DMRF) tech-nique helps to extract the foreground pixels. For a better extraction of the humansilhouettes, the energy function of DMRF is extended with texture information. Atmultiple people tracking stage, we concentrate on solving human-occlusion problemcaused by interacting persons based on silhouette data and a non-linear regressionmodel. Our model effectively transfers the person location problem during theocclusion into the finding of the local maximum points on a smooth curve, so thatvisual persons in the partial or complete occlusion can still be precisely captured. Wehave compared our algorithm with two other popular tracking algorithms: mean-shiftand particle-filter. Experimental results reveal that the correctness of our method ismuch higher than the mean-shift algorithm and slightly lower than a particle-filter,however, with the major benefit of being a factor of 10–15 faster in computing.

Keywords Multiple people tracking · MRF · Automatic assessment ·Group behavior analysis · Real time

J. Han (B) · P. H. N. de WithEindhoven University of Technology, Eindhoven, The Netherlandse-mail: [email protected]

P. H. N. de WithCycloMedia Technology, Waardenburg, The Netherlands

914 Multimed Tools Appl (2011) 51:913–933

1 Introduction

During the last 10 to 15 years, simulation-based training has emerged as a strategythat can be used to promote patient safety in maternity and newborn settings. Afterbeing exposed to both low- and high- risk scenarios in the realistic setting, traineesare better able to make nursing care decisions. Feedback from their instructors allowsthem to learn from their mistakes and further develop their clinical reasoning skills,ultimately leading to safer delivery of care. Performance assessment of trainees isan important and essential step in entire simulation training procedure. Basically,there are two types of performance assessment: on-line assessment and off-lineassessment. On-line assessment is that the instructor immediately gives remarksonce he finds mistakes, while off-line assessment is to provide feedbacks after ana-lyzing the videotape of the training course. Currently, both assessments are con-ducted manually, which is apparently low efficiency with the increasing number oftrainees and also training course. Consequently, it highly demands for an automaticperformance evaluation system for delivery simulation training. To this end, themotion and activity of trainees need to be detected and analyzed.

Video analysis system has proved ability for human behavior analysis [15], andsuch systems have been widely applied in healthcare domain during the recentyears. Koile et al. [14] uses a computer-vision system to monitor the location of aperson in a room in order to determine which activities are being completed, andwith which objects a user is interacting. In [21], a prototype system is designed tomonitor activities of elderly persons at home to assist the independent living. Inthis system, authors construct an advanced silhouette extraction, human detectionand tracking algorithm for indoor environments and an adaptive learning method isused to estimate the physical location and moving speed of a person from a singlecamera view without calibration. Our previous work [11] also intends to monitorthe daily activity of elderly person, but focuses on detecting fall incidents. Thenew idea proposed by this paper is to determine the fall event by analyzing thechanges of the main-axis of human. The work reported in [3] aims at detecting socialinteractions of the elderly in a skilled nursing facility using audio/visual records,because authors believe that changes in interaction patterns can reflect changes inthe mental or physical status of a patient. Although most of above systems aredealing with human behavior analysis in healthcare domain, straightly mapping theirtechniques to automatic evaluation of delivery simulation may not succeed due totwo problems. First, most of the above systems only need to track one or two peoplein the scene. However, in our scenario it is required to track small groups of people(up to 4 persons), and they are always moving together or interacting with each other.In these cases, individual people are not visually isolated, but are partially or totallyoccluded by other people. Second, the computational cost of most above systems isfar from satisfactory, so that they cannot be directly used in our application, where a(near) real-time system is expected.

This paper proposes a real-time system to detect, track and analyze a groupof people, which is intended to be the basic system of an automatic performanceassessment of trainees in a delivery simulation course. Our system is original in threeaspects:

– Unlike the existing object-segmentation algorithms that only consider the spatialand temporal information of the object, our Spatial-texture (StMRF) model

Multimed Tools Appl (2011) 51:913–933 915

seamlessly incorporates texture information as a pixel feature into the MRFframework. Additionally, we design a dynamic texture detector to indicate theedges caused by foreground objects, in which both contrasts of the originalimage and background image are employed. With our method, the segmentedsilhouettes of the human are more complete and accurate than the silhouettesextracted by conventional algorithms, such as Spatial MRF (SMRF) model andSpatial-Temporal MRF (STMRF) model.

– We propose an occlusion-handling approach, which models the horizontalprojection histograms of the human silhouettes, using a nonlinear regressionalgorithm. Our model effectively changes the person locating during the oc-clusion into the finding of the relative maximum points on a smooth curve(function), so that visual persons in the partial or complete occlusion can still beprecisely captured. We also improve the template-matching algorithm used byour previous work [8], where the key idea is that we adaptively assign differentweights to non-overlapped regions and overlapped regions when generatingthe appearance-model for an object during the occlusion. This treatment willimprove the accuracy of template matching in object tracking, especially for theocclusion case.

– A camera calibration algorithm is embedded into the system, thereby we cancompute the real speed of each person by considering the physical displacementof him/her in two successive frames. This visual feature in the real-world domainis invariant to the camera position. Different with the existing camera-calibrationtechnique applied in the human behavior analysis system that manually labels theinitial feature points, such as [2], our algorithm is robust and fully automatic.

2 Prior work in moving object segmentation and tracking

2.1 Moving object segmentation

Frame differencing [12] is a straightforward approach, which thresholds the differ-ence between two frames, and large changes are considered to be the foreground.Another approach is to build a representation of the background that is used tocompare against new images. Pixel-wise median filter [5] is a widely used backgroundmodeling technique, where the background is defined to be the median at each pixellocation of all the frames in the buffer. Instead of using the median value of a groupof pixels, a more reasonable assumption is that the pixel value follows a Gaussiandistribution in the temporal direction, and a model is used to compute the likelihoodof background and foreground for a particular pixel. When single Gaussian is notable to adequately account for the variance, a Mixture of Gaussians [22] is usedto improve the accuracy of the estimation. An alternative model is proposed byElgammal et al. [6], which estimates the probability of observing pixel intensityvalues based on a sample of intensity values of each pixel. The model is supposed toadapt quickly to changes in the scene, thereby enabling sensitive detection of movingtargets. A completely different idea, proposed by Oliver et al. [17], investigatesglobal statistics rather than the local one. Similar to eigenfaces, a eigenbackgroundis created to capture the dominant variability of the background. In [1, 18], authorsemploy an Markov Random Field (MRF) model to extract the moving object, in


which the idea is to determine the segmentation mask as a Maximum A-Posteriori(MAP) approximation. In this MRF model, both spacial and temporal coherency ofthe object are considered, so that the state of a pixel is also affected by its neighbors.Generally speaking, MRF model takes a good balance between complexity andaccuracy of the algorithm.

2.2 Moving object tracking

We can classify existing human tracking systems, according to their efficiency (real-time or non-real-time) and their functionality (track single person, multiple persons,handle occlusion). Pfinder [19] solves the problem of real-time tracking of peoplein complex scenes in which there is a single unoccluded person and fixed camera.The system called W4 [9] is a real-time visual surveillance system for detecting andtracking people and their body parts, and monitoring their activities in an outdoorenvironment. However, this system has a limited capability to handle occlusions.Mean-shift [4] is a real-time non-parametric technique that searches along densitygradients to find the peak of probability distributions. This approach is computa-tionally effective, but it is susceptible to converge to a local maximum and theocclusion remains problematic. The particle-filter technique [16] performs a randomsearch guided by a stochastic motion model to obtain an estimate of the posteriordistribution describing the object’s configuration. This method is robust in the sensethat it allows to handle clutters in the background, and it recovers from temporal lossof tracking. Unfortunately, there are high computational demands in the approach,and this is the bottleneck to apply it in real-time systems. In the layer tracking ap-proaches, such as [20], shape, motion and appearance models of the different layerscorresponding to foreground objects and the background are estimated along withthe ordering of the layers. Foreground layers behind background layers are labeledas occlusion. The drawback of these methods is the high computational cost andmaintaining the layer visibility. The work of [10] is highly related to our research, inwhich a silhouettes-based occlusion-handling algorithm is proposed. Both the shapeof the silhouette boundary and the projection histograms of the silhouette are usedto locate the people during the occlusion. This method assumes that the completebody silhouette of each person is detected and the size of the people in the pictureshould be sufficiently large (at least 75 × 50 pixels). Unfortunately, such assumptionsdo not always hold in real applications. Summarizing, none of the approaches solvessimultaneously the two primary problems mentioned at introduction section.

3 System overview

Figure 1 shows the flowchart of the proposed system, which consists of five compo-nents.

1. Object segmentation. We perform an adaptive background subtraction [22] toproduce the initial masks for the moving persons. These masks are input to aMRF-based segmentation algorithm that incorporates the spatial coherence forrobust foreground extraction. Here, the dynamic concept [13] helps to realize areal-time algorithm.


Fig. 1 Human detection, tracking and analysis system

2. Camera calibration. This component is to find the relationship between theimage coordinate system and world coordinate system. By doing so, the visualfeatures detected in the picture, such as positions of moving objects, can betransferred to the real-world domain. These 3D visual features will significantlyfacilitate to the semantic-level analysis due to its invariance to the location ofcamera.

3. Occlusion detection. This component distinguishes the occlusion and the non-occlusion cases, since we use different methods to treat these two cases. In ourmethod, if the distances in both x and y directions between several personsare smaller than a predefined threshold in the current frame, we consider thatthe occlusion may happen in the next frame. Otherwise, we keep the objectdescriptions separated. Based on this scheme, we can even deduce the numberof persons involved in the occlusion.

4. Object tracking. The mean-shift algorithm [4] is explored to track movingpersons in the non-occlusion case, and our nonlinear regression algorithm is usedto deal with occlusion.

5. Trajectory generation. We base the trajectory generation on our previous work[7], which adopts a Double Exponential Smoothing (DES) operator to model theposition of people collected from the tracking component. Using this method, wecan obtain more accurate and smooth moving trajectories.

In this paper, the emphasises are on our contributions: camera calibration, texture-controlled object segmentation and human occlusion handling. More details aboutother algorithms, e.g., adaptive background subtraction, mean-shift tracking andDES-based trajectory generation, can be found in [4, 7, 22].

4 Camera calibration based on homography mapping

The task of the camera calibration is to provide a geometric transformation thatmaps the points in the image domain to the real-world coordinates. In our system, we


base the human-behavior analysis on the trajectory of person on the ground, so thatthe height information of the human is not required. Since both the ground and thedisplayed image are planar, the mapping between them is a homography, which canbe written as a 3 × 3 transformation matrix H, transforming a point p = (x, y, w)�in image coordinates to the real-world coordinates p′ = (x′, y′, w′)� with p = Hp′,which is equivalent to

⎛⎝

xyw

⎞⎠ =

⎛⎝

h11 h12 h13

h21 h22 h23

h31 h32 h33

⎞⎠

⎛⎝

x′y′w′

⎞⎠ . (1)

The transformation matrix H can be calculated from four points whose positionsare both known in the real-world and in the image. In our previous work [7], wehave developed an automatic algorithm to establish the homography mapping foranalyzing the tennis video, where the court lines and their interaction points areidentified in the picture. Such lines and points are related to the lines and pointsin a standard tennis court. By finding the correspondences, homography mappingdescribed in (1) can be established. In our current system, we still want to apply thismethod, though there are no court lines on the ground. Our basic idea is to manuallyput four white lines forming a rectangle on the ground, whose function is as same asthe court lines in the tennis game. We then measure the length of each line in the realworld, therefore generating a physical model in the real world domain. This setup isaccepted and will be adopted in the new training center of our hospital partner. Thecomplete system comprises three steps, whose basic ideas are described below:

1. White Pixel Detection. This step identifies the pixels that belong to white lines.Since the lines that we put on the ground are usually white, this step is essentiallya white pixel detector. The mandatory feature of this step is that white pixels thatdo not belong to white lines (white table, white wall) should not be marked.

2. Line Parameter Estimation. Initializing with the detected white pixels, lineparameters can be extracted using a RANSAC-based line detector, which hy-pothesizes a line using two randomly selected points. If the hypothesis is verified,all points along the line are removed and the algorithm is repeated to extract theremaining dominant lines in the image.

3. Model Fitting. After a set of lines have been extracted from the image, we needto know which line in the image corresponds to which line in the physical model.It may also be the case that lines are detected other than those present in themodel or that some of the lines were not detected. This assignment is obtainedwith a combinatorial optimization, in which different configurations of lines aretired and evaluated. Finally, the configuration with the minimal cost is selected asthe best one. Once these correspondences are known, the homography betweenreal-world coordinates and the image coordinates can be computed (Fig. 2).

Our camera calibration is only executed in the first several frames, because thecamera in our scenario is supposed to be fixed. For more details concerning thecamera calibration algorithm, we refer to an earlier publication [7].


Fig. 2 Illustration of camera calibration, which consists of white point detection, line detection,model fitting and parameter computation

5 Texture-controlled object segmentation

5.1 Notation and basic model

Object segmentation can be treated by a Markovian-based Maximum A-Posterior(MAP) approximation. Given the observation d and the configuration of labels f ,the posterior probability of f is

P( f |d) = P(d| f )P( f )P(d)

. (2)

Maximizing P( f |d) equals to maximizing the product of the class conditional proba-bility P(d| f ) and the priori probability P( f ). Normally, we assume that P(d| f ) andP( f ) take the form of a Gibbs distribution, then the MAP problem can be solvedby minimizing a Gibbs energy E( f |d). For moving object segmentation, only theforeground label F and the background label B are considered. Energy E( f |d) iscomposed of a node energy En and a smoothness energy Es [1], such that

E( f |d) = En + Es =∑

i

Vi( f, d) +∑

(i, j)∈Ci

Vi, j( f, d). (3)

The node energy is simply the sum of a set of per-pixel node costs Vi( f, d), and thesmoothness energy Es is the sum of spatially varying horizontal and vertical nearestneighbor smoothness costs. Here, Ci is the clique of location i. A key issue in theMRF model is the definition of the energy function. There are two definitions usedin the segmentation algorithm, each being briefly explained below.


– Spatial MRF model (SMRF). SMRF only considers the spatial coherency ofthe object, where the smoothness energy Es in (3) is developed in a spatialnetwork, which consists of nodes connected by edges that indicate conditionaldependency.

– Spatial-Temporal MRF Model (STMRF). In [18], a temporal continuity term hasbeen added into the energy function, thereby allowing to maintain the coherencyof the segmentation through time. Basically, if a region has been classified asforeground several times in the past, it is assumed to be classified as foregroundagain in the current frame. According to our experiments, the gain of this modelis very limited in contrast to SMRF model, because the above assumption maynot keep valid in some situations, especially for nonrigid object, such as humanin our scenario.

5.2 Proposed texture-controlled MRF model

MRF-based algorithm achieves better object-segmentation performance by takingthe object coherency in both spatial and temporal dimensions into account, incontrast to the methods that consider the pixels independently [7, 22]. In thepresently existing algorithms, color/intensity and the relation of the pixel locationsare important features for object representation, though they are insufficient torepresent all types of objects. For this reason, in many cases such approachesfundamentally fail to precisely approximate the boundary of an object, as theydo not directly consider texture, which is one of the most important perceptualattributes of any object. To address this problem, we propose a new MRF-basedobject-segmentation algorithm that seamlessly incorporates texture information as apixel feature in the MRF framework. Our new system sequentially consists of fourstages, which are addressed below.

5.2.1 GMM-based initialization

In our system, an adaptive Gaussian Mixture Model (GMM) for background sub-traction [22] is employed to produce the initial foreground masks. This algorithmmaintains a Gaussian-mixture probability function for each pixel separately, wherethe parameters for each Gaussian distribution are updated in a recursive way. Thisalgorithm also involves a shadow-detection algorithm to remove shadow pixels.

5.2.2 Dynamic edge detection

Edge information is an indication of being an object boundary. Normally, there aretwo kinds of edge: static edge and dynamic edge. The former refers to the edge inthe background part of the image, and the dynamic edge is caused by the movingobject. Apparently, we are only interested in edges that caused by the moving object,so that dynamic edge is required to be detected. Since the background is known,a straightforward idea is to subtract the edge in the background image from theedge of the current image. However, the method based on this idea must accidentlyremove some dynamic edge pixels that belong to the static edge as well. This situationhappens when the dynamic edge and static edge are overlapping somewhere due tothe moving of the object. To address this problem, we define a dynamic edge pixeldetector, which can attenuate the edge in the background while preserving the edge


across foreground/background boundaries by considering the background model ofthe image as a reference. This detector is formulated as:

Dtexture(i) = tex(i) · 1

1 + texB(i) · exp(−|I(i)−IB(i)|

c

) , (4)

where i is the index of the pixel, and tex(i) is the texture value of the pixel i inthe current image by performing Sobel operator. Similarly, texB(i) refers to thetexture value of the pixel i in the background image. Parameters I(i) and I B(i)represent the intensity value of the pixel i in the current image and in the backgroundimage, respectively. If

∣∣I(i) − I B(i)∣∣ → 0, the pixel i has a high probability to be

a background pixel, and exp(−|I(i)−IB(i)|

c

)→ 1. Otherwise, it may belong to the

edge caused by the foreground boundary, and exp(−|I(i)−IB(i)|

c

)→ 0. The parameter

c is a constant, which is set to 50 in all our experiments. Figure 3 shows thecomparison results by using simple subtraction method and our method. Obviously,most dynamic edges caused by foreground object are well preserved, but most staticedges are greatly attenuated by our method. Differently, though simple subtractionmethod can remove the static edges successfully, it unfortunately deleted somedynamic edges.

5.2.3 Spatial-textural MRF model (StMRF)

We have found that the extracted object has no a sharp/clear boundary (either under-estimating or over-estimating) if we employ the MRF models introduced above. Themajor reason is that they consider a boundary pixel as a normal pixel, whose energyis always affected by its neighboring pixels in the MRF framework. To solve thisproblem, we have involved texture information when designing the energy function,since texture is a good clue to detect an object boundary. More specifically, if apixel is labeled as the boundary pixel by our texture analyzer, the relation energycovered by an edge energy term, between this pixel and its neighborhood is assumedto be smaller. This is because we assume that the boundary pixel more relies onits node energy, but has a reduced influence from its neighborhood. Conversely, wecan reduce the influence of the boundary pixel on its neighborhood as well, thereby

Fig. 3 Dynamic edge detector. Left: original image. Middle: dynamic edge by subtracting the edgein the background image from the edge of the current image. Right: our method


avoiding over-estimating boundary situations. Mathematically, the energy functionis defined as:

E( f |d) = Esn + Et

n + Ess + Et

s, (5)

where parameters Esn, Et

n refer to the node energy in the spatial dimension andthe node energy taking texture information into account, respectively. Since thetemporal continuity term cannot bring a big gain, we do not consider it in our energyfunction design. Parameter Es

n is defined as:

Esn =

∑i

Vsi ( f, d), and Vs

i ( f, d) = δ( fi, F). (6)

Here, δ(·, ·) denotes the Kronecker delta function. The new parameter Etn refers to

the node energy based on texture information, which is specified as:

Etn =

∑i

Vti ( f, d), and Vt

i ( f, d) = Dtexture(i)/255. (7)

The smoothness energy Ess in (5) is developed in a spatial network, which consists

of nodes connected by edges that indicate conditional dependency. Normally, Ess is

modeled by a Generalized Potts distribution, which is specified by:

Ess =

∑(i, j)∈Ci

Vsi, j( f, d), and Vs

i, j( f, d) = δ( fi, f j). (8)

The parameter Ets in (5) is the smoothness energy based on texture information,

defined by

Ets =

∑(i, j)∈Ci

Vti, j( f, d), and Vt

i, j( f, d) = −0.5 × δ(Ii, I j) × Dtexture(i)/255. (9)

5.2.4 Energy function minimization by dynamic graph cuts

Dynamic graph cuts [13] is proposed to solve any similar energy function. Considertwo MRFs Ma and Mb , whose corresponding energy functions Ea and Eb differ by afew terms. Suppose the MAP solution of Ma can be available by solving the max-flowproblem on the graph Ga and the energy Ea, and now we want to find the solution ofMb . The core idea of this dynamic concept is that we only find the MAP solution forthe different regions between Ga and Gb in the residual graph, but reuse max-flowsolution of Ga for the unchanged regions. Since Gb has only slight difference withGa, the whole optimization procedure of MRF is significantly enhanced in execution,thereby resulting in a real-time algorithm.

6 Occlusion handling using nonlinear regression

In this paper, we also employ the silhouette-related image features to locate peopleduring the occlusion. However, our technique is different from the method in [10] intwo aspects: (1) we do not use natural vertices at the silhouette boundary of thepeople, since we have found that this image feature is too sensitive to the noiseand the shape of the silhouettes; (2) unlike the method [10] that directly treats theprojection histograms of the silhouettes, we firstly model the contour of the silhouette


by a nonlinear regression algorithm. Afterwards, we search relative maximum pointson a smooth curve which is automatically generated by the regression technique. Inour model, each relative maximum point corresponds to one visual person in theocclusion. Our method is robust against the case where the silhouette of the personis fragmented, e.g., only a part of the head is segmented.

6.1 Locating the people during the occlusion

Once acquiring the segmented binary map of humans, we can obtain the contour ofthe upper part of the human by using:

C(x) = max {x | (x, y) ∈ Q} , (10)

where Q is the collection of the foreground pixels (x, y) on the binary map. Afterobtaining the contour map, the next step is to find the relative maximum pointson it. Instead of directly searching on the map, we search relative maximum pointson a smooth curve which is automatically generated by the regression technique.Therefore, our method is robust against uncompleted body silhouettes and thenoises. Figure 4 illustrates the basic components of our complete occlusion-handlingscheme, where two major components are people locating and people recognizing.

Given the contour C(x), we intend to find a smooth curve (function) to modelit. This problem can be solved by minimizing χ2, which is the sum of the squaredresiduals over m points for the model F(c, x), hence

χ2 =m∑

i=1

(C(xi) − F(c, xi))2 . (11)

The parameters of the model are in the vector c = {c0, c1, . . .}. The model F(c, x)

used in this algorithm is a Gaussian model, whose term number equals to the numberof persons in the occlusion. The number of the person in the occlusion can be easilydetermined by the occlusion-detection component. For instance, if we know thatthere are n persons involved in the occlusion, the applied Gaussian model wouldhave n terms, and be written as:

F(c, x) =n∑

j=1

c j1exp

(− (x − c j2)

2

2c2j3

). (12)

Fig. 4 The illustration of the complete occlusion handing scheme, where two major components arepeople locating and people recognizing


The Levenberg-Marquardt optimization algorithm helps to solve the minimizationproblem mentioned above, returning the best-fit vector c and their covariance matrixof the individual components. Once we obtain all the parameters of F(c, x), the nextstep is to find the relative maximum (peak) points on this curve, each correspondingvisually to the head of one person. Our model is fully automatic, e.g., it can outputtwo peak points when two persons are partially occluded, but provide only one peakpoint when they are completely occluded.

6.2 Recognizing the people during the occlusion

Apart from locating the person during the occlusion, it is also required to recognizethe person. People recognition is actually a procedure that finds the correspondencesbetween detected candidates and known templates. In our previous work [8], wehave designed a template matching scheme based on mean-shift algorithm [4].Assuming that there are N templates for N persons, and all the templates aregenerated and maintained by the mean-shift algorithm before the occlusion occurs.The probability of the feature {ui}i=1...m in the nth template is Tn(ui). Here, ui

represents the color histogram distribution and i denotes the bin number of thehistogram. Furthermore, we use the same method to model the appearance of theblobs obtained by the people-locating module, and represent them by W j(ui). Theaim of template matching is to find the best match to the template. Mathematically,finding the best match means maximizing the correspondence between the templatesand the blobs in the occlusion, so that we compute

C j = arg maxn

ρ(

W j(ui), Tn(ui)). (13)

The term ρ(W j(ui), Tn(ui)

)is a metric to measure the matching between the tem-

plates and the target candidate. For metric ρ, the Bhattacharyya coefficient [4] isused because of its performance, which is defined by:

ρ(

W j(ui), Tn(ui))

=m∑

i=1

√W j(ui) · Tn(ui). (14)

If we look at our color distribution model more carefully, we can find that eachpixel within the blob equally contributes to construct the model. This is absolutelycorrect for non-occlusion case. However, it is not optimal when having occlusionamong objects, because some pixels belonging to one object must be overlapped bysome other pixels of another object in this case. Apparently, if those overlappedpixels are still counted in the color model of this object, a large error would begenerated. For this reason, we need a specific color model for occlusion case, whichtakes the object interaction into account. Our basic idea is that we trust non-overlapped area more, but attenuate the contributions from overlapped area whenconstructing color distribution model. More specifically, if a pixel belongs to non-overlapped region, it is expected to have bigger weight in the template-matchingprocedure. Otherwise, we assign a small weight to it. This idea can be formulated as:

ρ(·, ·) =m∑

i=1

wi

√W j(ui) · Tn(ui), and wi = Tn(ui)∑N

j=1 T j(ui). (15)


Fig. 5 Illustration of objectinteraction. We assign smallweights to black areas

Here, wi is a weight factor assigned to each pixel within the blob. If the pixel is notoverlapped by others and its color does not appear in other objects, wi → 1, whichdegenerates to (14). On the other hand, if the pixel is a overlapped pixel and itscolor must appear in other objects, wi → 0, which means this pixel would give zerocontribution to the matching metric. Our weight-based template-matching algorithmadequately considers the interactions between objects and adaptively assigns theweight to a pixel according to its possibility to be overlapped. A geometricalexplanation is illustrated in Fig. 5, in which three persons are interacted with eachother. In this example, we assign small weights to pixels in the black areas, which canbe computed by (15).

7 Experimental results

Our proposed system is implemented using C++ on the Windows XP platform, andthe computation time is measured on a 3 GHz Pentium 4 computer with 1 G RAM.We have tested the presented algorithms on a database, containing 15 videos. Thesevideos are selected based on the suggestions from our hospital partner, which canrepresent most of the typical situations in a delivery training course. The length ofeach video varies from 2 min to 25 min. Two test videos and their ground truth dataare found from public databases, which are used to evaluate our object segmentationand tracking algorithms. The rest videos are captured by a TRV30E video camerafrom a leading manufacturer.

7.1 Evaluation of camera calibration algorithm

We have tested our camera calibration algorithm using two video sequences, whichwere captured at our office (slightly different lighting conditions and view angles).Our algorithm is correct for almost 100% of the sequences, but only fails at severalimage samples, where one of the white lines is largely occluded by moving persons(> 70% pixels are invisible). Fortunately, this false calibration can be easily detectedby measuring the difference of camera parameters between two consecutive frames,because camera parameters in these two frames should be similar when the camerais fixed in the scene, which is indeed our case. Note that we only compute and refinethe camera parameters using the first 10 frames of a sequence in our system, since thecamera in our application is supposed to be fixed. Figure 6 shows example pictures,where the results of white-pixel detection, white-line detection and model fittingare illustrated. It is observed from the second example that our algorithm is robustagainst the situations, in which the moving person wares a cloth with white color or


Fig. 6 Line detection-based camera calibration. Left: white point detection. Middle: line detection.Right: model fitting

the white lines are partially occluded by moving persons. As we mentioned before,this camera calibration algorithm reuses and simplifies the idea we proposed in theprevious work. For more intensively experimental results, we refer to our earlierpublication [7].

7.2 Evaluation of StMRF-based object segmentation algorithm

Our proposed object segmentation algorithm has been tested on two video se-quences, where the first sequence with ground truth is downloaded from VSSN’05(foreground detection competition) and the second one is captured in our office.Both sequences demonstrate the indoor environment with stable lighting condition,which is similar to the scenario of delivery simulation training course in the hospital.We have also compared our StMRF algorithm with Kernel-based algorithm [6],GMM-based algorithm [22], and the SMRF-based algorithm. The results are re-ported in Fig. 7, where the subfigure (a) gives several image samples of differentmethods, in which the results achieved by the GMM algorithm, SMRF model, andStMRF model are shown, respectively. It can be observed that foreground segmentedby the GMM algorithm includes noises, which are not intended foreground objects.Though SMRF model filter noisy areas out, the boundary of the moving personis considerably less or over estimated by this model. Conversely, our proposedStMRF model effectively reduces the influence from the noise, while keeping a sharpboundary at object of interest (see foot and head areas of the image). The subfigure(b) draws the ROC curves of GMM algorithm, kernel-based algorithm, SMRFalgorithm and also our StMRF algorithm. Obviously, StMRF is much better than thefirst two methods and slightly better than SMRF algorithm, which also reflects that

�Fig. 7 Object segmentation results. a Image samples achieved by three algorithms. b ROC curves offour algorithms. c Efficiency comparison of three algorithms


(c)

(b)

(a)


our method is less sensitive to the thresholds used to determine the foreground pixels.For the ideal indoor scenario (fixed camera + stable lighting condition), our methodcan achieve > 97% accuracy. Finally, the subfigure (c) compares the execution timeof each algorithm tested on the second video (320×240), where we can find thatGMM-based algorithm is the fastest one. Both our algorithm and kernel-basedalgorithm require an initialization step in the first frame. Kernel-based algorithmuses this initialization step to estimate the background model, but our algorithm usesit to initialize the graph-cut optimization. It can be viewed that our method spends120.4 ms on initialization and 45.1 ms per frame for the rest frames. The graph inFig. 7c indicates that our dynamic concept-based system is close to a real-time systemwith small fluctuations between successive frames.

7.3 Evaluation of regression-based occlusion handling algorithm

The proposed occlusion handling algorithm has been tested on 11 video clips (eachone has more than 1, 000 frames), where 8 videos are captured at our office including2–3 persons, 2 videos are captured at hospital when a real delivery training coursehappening, and 1 video is downloaded from a public database (PETS2004). In ourdataset, we have divided occlusion scenarios into two categories. The first categorydescribes the situation that persons walk towards each other and immediately splitafter occlusion. The second category is that several persons join, and then walkcirclewise after they meet. Obviously, the second one is more challenging, in whichocclusion occupies more than 70% frames.

We compared our system with the mean-shift algorithm [4], the particle-filteralgorithm (200 particles per object) [16] and also the algorithm proposed in ourprevious work [8], where we did not consider the interactions among objects whengenerating the appearance model for each object. Table 1 shows the system capabilityof treating the occlusion, in which NP is a simplified form of number of person. Theground-truth data was manually marked, whose evaluation criterion is that at least70% of the human body is included in the detection window. To sum up, the mean-shift algorithm achieves an averaged 76.4% tracking accuracy during the occlusionevents. The particle-filter algorithm achieves an averaged 92.6% accurate rate, ourprevious algorithm is correct for 88.7% frames when the occlusion occurs, and ourcurrent algorithm is correct for 89.9% frames in the same condition. The result has

Table 1 System performance comparison

Occlusion type NP MS PF Han et al. [8] Our algorithm

Clip1 Walk towards 2 100% 93.9% 100% 100%Clip2 Walk towards 2 60.8% 85.9% 84.4% 85.1%Clip3 Walk towards 2 100% 100% 100% 100%Clip4 Walk towards 2 86.7% 97.2% 94.3% 95.1%Clip5 Walk circlewise 2 80.6% 99.5% 95.6% 96.1%Clip6 Walk circlewise 2 54.8% 99.2% 79.6% 82.2%Clip7 Walk circlewise 2 100% 100% 100% 100%Clip8 Walk circlewise 3 90.1% 81.1% 90% 90.5%Clip9 Walk circlewise 3 70.2% 85.5% 93.1% 94.8%Clip10 Walk circlewise 3 40.3% 89.8% 71.3% 74.7%Clip11 Walk circlewise 3 56.7% 85.4% 70.1% 73.8%


proved that our weight-based template-matching algorithm has better performancewhen tracking people during the occlusion. Note that the most common failure wascaused by the situation, where our people locating algorithm mistakenly indicatesthe position of each person. Figure 8 portrays four examples, where we show thetracking results before the occlusion, during the occlusion and after the occlusion.The first and third videos were captured at our office, the second one is fromPETS2004 database and the last video demonstrates the real delivery training courseat a hospital.

In addition to comparing the performance of these four algorithms, we havealso evaluated their efficiency. Figure 9 gives the time consumed for each trackingalgorithm during the occlusion, and also the execution times of the whole system. Inthe occlusion situation, the average execution times per frame for the tracking com-ponent of mean-shift, particle-filter, our previous work and our current technique are

Fig. 8 Occlusion handling. Left: prior to occlusion; middle: during occlusion; right: after occlusion


Fig. 9 System efficiency. aRunning time of the algorithmduring the occlusion. bRunning time of the wholesystem

(b)

(a)

20.45, 504.92, 9.12 and 9.87 ms, respectively. Apparently, our previous technique [8]is the most efficient algorithm. And, our current technique is slightly expensive thanour previous work, because it requires to compute the wights for some overlappedpixels. Moreover, the execution times per frame for the complete system (mean-shift, particle-filter, and our current method) including object segmentation, objecttracking and trajectory generation, are 54.52, 553.78 and 49.86 ms, respectively.The graph in Fig. 9 reveals that our proposal is a near real-time system with smallfluctuations between successive frames.

8 Conclusion

In this paper, a new system for automatic analysis of the behavior of a small groupof persons based on video signal has been introduced, thereby concentrating onthree primary contributions. First, we take the texture information into accountwhen designing the energy function for DMRF-based object segmentation, so that


the silhouette of the object can be extracted accurately. Second, to address theocclusion problem, we have proposed an approach, where we model the silhouette-related image features into a feature vector which is optimized using a nonlinearregression algorithm. Therefore, the object position can be located through thefinding of the peak points of a smooth curve when object occlusion occurs. The lastcontribution can be revealed that we embedded a camera calibration algorithm intoour framework. By doing so, the real-world speed of each person can be achieved,which is invariant to the changes of camera position. It has been shown that ouralgorithm can operate at near real-time speed with around 90% tracking accuracyduring the occlusion.

The future work will be the creation of the behavior-analysis phase based onthe input visual features, such as real-world speed and trajectory of each person,which have been provided by the introduced system. Such a complete and automaticevaluation system will benefit the delivery training in two aspects. Firstly, it facilitatesto find the mistakes made by trainees during the training course, allowing them tofurther develop their clinical skills. Secondly, automatic system is highly efficient withthe increasing number of trainees.

Acknowledgements The authors would like to thank Mr. Minwei Feng for undertaking a part ofalgorithm implementation and evaluation work. We would also like to thank our hospital partner(Maxima Medisch Centrum of Eindhoven) for the kind cooperation.

Open Access This article is distributed under the terms of the Creative Commons AttributionNoncommercial License which permits any noncommercial use, distribution, and reproduction inany medium, provided the original author(s) and source are credited.

References

1. Aach T, Kaup A (1995) Bayesian algorithms for adaptive change detection in image sequencesusing markov random fields. Signal Process Image Commun 7:147–160

2. Antonakaki P, Kosmopoulos D, Perantonis S (2009) Detecting abnormal human behavirourusing multiple cameras. Signal Process 89(9):1723–1738

3. Chen D, Yang J, Malkin R, Wactlar H (2007) Detecting social interactions of the elderly in anursing home environment. ACM Trans Multi Comp Comm Appl 3:1–22

4. Comaniciu D, Ramesh V, Meer P (2003) Kernel-based object tracking. IEEE Trans Pattern AnalMach Intell 25(5):564–577

5. Cutler R, Davis L (1998) View-based detection. In: Proc. Int. Conf. Patt. Reco., vol 1, pp 495–5006. Elgammal A, Harwood D, Davis L (2002) Backgournd and foreground modeling using mon-

parametric kernel density estimation for visual surveillance. Proc IEEE 90(7):1151–11637. Han J, Farin D, de With PHN (2008) Broadcast court-net sports video analysis using fast 3-D

camera modeling. IEEE Trans Circuits Syst Video Technol 18(11):1628–16388. Han J, Feng M, de With PHN (2008) A real-time video surveillance system with human occlusion

handling using nonlinear regression. In: Proc. Conf. ICME, vol 1, pp 305–3089. Haritaoglu I, Harwood D, Davis L (1998) W4: who, when where, what: a real time system for

detecting and tracking people. In: Proc. Conf. Face and Gesture Recognition, pp 222–22710. Haritaoglu I, Harwood D, Davis L (2000) Real-time surveillance of people and their activities.

IEEE Trans Pattern Anal Mach Intell 22(8):809–83011. Hazelhoff L, Han J, de With PHN (2008) Video-based fall detection in the home using principal

component analysis. In: Proc. Int. Conf. on Adva. Concep. for Intel. Visi. Syst, vol 1, pp 298–30912. Jain R, Nagel H (1979) On the analysis of accumulative difference pictures from image sequences

of real world scenes. IEEE Trans Pattern Anal Mach Intell 1:206–214


13. Kohli P, Torr P (2007) Dynamic graph cuts for efficient inference in markov random fields. IEEETrans Pattern Anal Mach Intell 29(12):2079–2088

14. Koile K, Tollmar K, Demirdjian D, Shrobe H, Darrell T (2003) Activity zones for context-awarecomputing. In: Proc. Int. Conf. on Ubiquitous Computing, pp 90–106

15. Lao W, Han J, de With PHN (2007) A matching-based approach for human motion analysis. In:Proc. SPIE. Int. Conf. on Multimedia Modeling, vol 2, pp 405–414

16. Nummiaro K, Koller-Meier E, Van Gool L (2003) An adaptive color-based particle filter. ImageVis Comput 21(1):99–110

17. Oliver M, Rosario B, Pentland A (2000) A bayesian computer vision system for modeling humaninteractions. IEEE Trans Pattern Anal Mach Intell 22(8):831–843

18. Tsaig Y, Averbuch A (2000) A region-based MRF model for unsupervised segmentation ofmoving objects in image sequences. In: Proc. Int. Conf. on Compu. Visi. and Patt. Reco.(CVPR),pp 889–896

19. Wren C, Azarbayejani A, Darrell T, Pentland A (1997) Pfinder: real-time tracking of the humanbody. IEEE Trans Pattern Anal Mach Intell 19(7):780–785

20. Zhou Y, Tao H (2003) A background layer model for object tracking through occlusion. In: Proc.Conf. ICCV, pp 1079–1085

21. Zhou Z, Chen X, Chung Y, He Z, Han T, Keller J (2008) Activity analysis, summarization andvisualization for indoor human activity monitoring. IEEE Trans Circuits Syst Video Technol18(11):1489–1498

22. Zivkovic Z (2004) Improved adaptive Gaussian mixture model for background subtraction. In:Proc. Int. Conf. Pattern Recognition (ICPR), pp. 28–31

Jungong Han received the B.S. degree (with honor) in control and measurement engineering fromXidian University, China, in 1999. In 2004, he received his Ph.D. degree in communication andinformation engineering from Xidian University. In 2003, he has been a visiting scholar at InternetMedia group of Microsoft Research Asia, China, with the topic on scalable video coding. Since 2005,he joined the department of signal processing systems at the Technical University of Eindhoven, TheNetherlands, where he is leading the research on video content analysis. His research interests arecontent-based video analysis, video compression and scalable video coding. He is now involved increating research programs to actively exploit content analysis in a few hospitals in the Netherlandsfor improving healthcare.


Peter H. N. de With (IEEE Fellow) graduated in electrical engineering from the University ofTechnology in Eindhoven and received his Ph.D. degree from the University of Technology Delft,The Netherlands in 1992. He joined Philips Research Labs Eindhoven in 1984, where he becamea member of the Magnetic Recording Systems Department. From 1985 to 1993, he was involvedin several European projects on SDTV and HDTV recording. In this period, he contributed asa principal coding expert to the DV standardization for digital camcording. In 1994, he became amember of the TV Systems group at Philips Research Eindhoven, where he was leading the designof advanced programmable video architectures. In 1996, he became senior TV systems architectand in 1997, he was appointed as full professor at the University of Mannheim, Germany, at thefaculty Computer Engineering. In Mannheim he was heading the chair on Digital Circuitry andSimulation with the emphasis on video systems. Between 2000 and 2007, he was with LogicaCMG(now Logica) in Eindhoven as a principal consultant. Early 2008, he joined CycloMedia Technology,The Netherlands, as vice-president for video technology. Since 2000, he is professor at the Universityof Technology Eindhoven, at the faculty of Electrical Engineering and leading a chair on VideoCoding and Architectures. He has written and co-authored over 200 papers on video coding,architectures and their realization. Regularly, he is a teacher of the Philips Technical Training Centreand for other post-academic courses. In 1995 and 2000, he co-authored papers that received theIEEE CES Transactions Paper Award, in 2004 the VCIP Best Paper Award and in 2006 the ISCEpaper award. In 1996, he obtained a company Invention Award. Mr. de With is a Fellow of theIEEE, program committee member of the IEEE ICCE, ICIP and VCIP, board member of the IEEEBenelux Chapters for Information Theory and Consumer Electronics, scientific advisor to Philipsdivisions, and other companies, the Dutch Imaging school ASCII.

real-time multiple people tracking for automatic group

Documents