detection, segmentation, and tracking of moving...

Detection, Segmentation, and Tracking of Moving Objects in UAV Videos

Michael Teutsch and Wolfgang KrugerFraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB)

Fraunhoferstr. 1, 76131 Karlsruhe, GermanyEmail: {michael.teutsch, wolfgang.krueger}@iosb.fraunhofer.de

Abstract—Automatic processing of videos coming from smallUAVs offers high potential for advanced surveillance applica-tions but is also very challenging. These challenges includecamera motion, high object distance, varying object back-ground, multiple objects near to each other, weak signal-to-noise-ratio (SNR), or compression artifacts. In this paper,a video processing chain for detection, segmentation, andtracking of multiple moving objects is presented dealing withthe mentioned challenges. The fundament is the detection oflocal image features, which are not stationary. By clusteringthese features and subsequent object segmentation, regionsare generated representing object hypotheses. Multi-objecttracking is introduced using a Kalman filter and consideringthe camera motion. Split or merged object regions are handledby fusion of the regions and the local features. Finally, aquantitative evaluation of object segmentation and trackingis provided.

I. INTRODUCTION

UAV-based camera surveillance is widely used nowadaysfor reconnaissance, homeland security, or border protection.Robust tracking of single or multiple moving objects isimportant, but difficult to achieve. Camera motion, smallobject appearances of only few pixels in the image, changingobject background, object aggregation, shading, or noise areprominent among the challenges.

In this paper, we present a three-layer processing chainfor multi-object tracking dealing with these challenges. Asmall UAV is used with a visual-optical camera directedperpendicularly to the ground. In the first layer, local imagefeatures are detected and categorized in stationary and mov-ing features. Object hypotheses are generated in the secondlayer by moving feature clustering and appearance-basedobject segmentation. In the third layer, the outputs of the firstand second layer are used for tracking including handlingof split and merge situations, which occur when vehiclesovertake each other. Some experiments with quantitativeresults demonstrate the effectiveness of our processing chain.

Related Work: Related work is presented, which is atleast partially covering all three layers and using aerialimage data. Perera et al. [1] use KLT [2] features for imageregistration and Stauffer-Grimson background modeling formoving object detection. Multi-object tracking is performedusing a Kalman filter and nearest-neighbor data association.Splits and merges are handled by track linking. Cao et al. [3]also use KLT features, which are tracked for two frames and

used for image registration. Vehicles are detected as blobsin the difference image and tracking is implemented usingmotion grouping. Kumar et al. [4] use motion compensateddifference images and change detection to find movingobjects. Tracking is based on motion, appearance, and shapefeatures. Yao et al. [5] compensate camera motion byestimating a global affine parametric motion model based onsparse optical flow. Blobs are extracted from the differenceimage with morphological operations and tracked using HSVcolor and geometrical features. Trajectories are stored in agraph structure for split and merge handling. For imageregistration, Ibrahim et al. [6] use SIFT/SURF featuresand RANSAC. Moving objects are detected by Gaussianmixture learning applied to difference images and shape/sizeestimation. Tracking is based on temporal matching of objectcharacteristics such as shape, size, area, orientation, orcolor. Xiao et al. [7] simultaneously detect moving objectsusing a three-frame difference image and perform multi-object tracking with a probabilistic relation graph matchingapproach with an embedded vehicle behavior model. Reillyet al. [8] use Harris corners, SIFT descriptors, and RANSACfor image registration. Moving objects are detected using adifference image calculated along 10 frames. For tracking,bipartite graph matching is applied. Finally, in the workof Mundhenk et al. [9], moving objects are detected bysubtraction of global motion from local motion maps. Forobject segmentation, a Mean Shift Kernel Density estimatoris applied. Tracking is performed by fitting a GeneralizedLinear Model (GLS) between two consecutive frames fortrack verification. Object-related fingerprints are calculatedand stored for reacquisition, if necessary.

II. INDEPENDENT MOTION DETECTION

The concept of the processing chain is visualized inFig. 1. In the independent motion detection layer, localimage features are detected, which show independent motionrelative to the static background of the monitored scene.Corner feature tracking [2] is used to estimate homogra-phies as global image transformations [10] for frame-to-frame alignment as well as local relative images velocities.Independent motion is detected at features with significantrelative velocities to discriminate features at moving objectsfrom features at the static background.

independent motion detection

local featuredetection and

tracking

movingfeatures

homographyestimation

frame-to-framehomography

independentmotion

detection

imagesequence

multi-object tracking

camera motion

object segmentation

local featureclustering

objectsegmentationalgorithm 1

objectsegmentationalgorithm n

fusion

objectslocal features

regions

Kalmanfilter

assignment ofregions to tracks

assignment offeatures to tracks

tracks

update

init prediction

split and mergehandling

okambiguous

...

Figure 1. Concept of the UAV video processing chain.

This approach does not require camera calibration and islargely independent of object appearance. Using homogra-phies instead of plane+parallax decompositions with multi-view geometric constraints [11] is adequate, since our imagedata is mainly from UAVs operating at higher altitudes. Wedo not use motion compensated difference images [4], [7],because we achieved more reliable results compared to [12],which is based on difference images.

The estimated frame-to-frame homographies are used attwo steps in our processing chain to compensate for globalcamera motion. During independent motion detection theyare needed to estimate relative image velocities from featuretracks, and in the multi-object tracking layer the homogra-phies are used to generate control input for the Kalman filter.

The main output from the independent motion detectionlayer is the local image features classified as moving relativeto the static background. The output attributes for eachfeature are image position and relative velocity. First, themoving features are used in the object segmentation layerto find motion regions by feature clustering and to triggeradditional appearance-based region segmentation. Second,the moving features are used in conjunction with the seg-mented regions to apply multi-object tracking includingtrack initialization.

Fig. 2 gives an idea of typical results from the independentmotion detection layer. Shown are the estimated relativevelocity vectors for all 5356 feature tracks from which297 have been correctly classified as coming from movingfeatures. Note the large number of feature tracks at parked

Figure 2. Example for detected and tracked stationary (red) and moving(yellow) local features.

vehicles which have been correctly classified as part of thestatic background. The independent motion layer is able toreliably estimate and classify sub-pixel relative motion.

III. OBJECT SEGMENTATION

The aim of object segmentation is to generate objecthypotheses from the moving local features. Therefore, clus-tering is applied followed by spatial fusion of several objectsegmentation algorithms to improve hypotheses reliability.

A. Local Feature Clustering

The first processing step in the object segmentation layeris to cluster the detected moving features. We employ asingle-linkage clustering using position and velocity esti-mates. The selection of distance thresholds is based on theknown Ground Sampling Distance (GSD) and the expectedsize of vehicles.

Especially in crowded scenes, over- or under-segmentation of objects close to each other and with similarmotion cannot be avoided and additional appearance-basedimage features should be exploited.

B. Object Segmentation Algorithms

The calculated image features are different kinds of gradi-ents. Each feature value is written to a feature-specific accu-mulator image. This accumulator is needed as we calculatemulti-scale features for higher robustness and store the resultin the same image. Three different kinds of gradient featureshave been implemented:

1) Korn gradient [13]: This is a linear gradient calcula-tion method similar to Canny but with a normalizedfilter matrix. We directly use the gradient magnitudeswithout directions or non-maximum-suppression.

2) Morphological gradient [14]: By using the morpho-logical operations erosion () and dilation (⊕) aswell as quadratic structuring elements si of differentsize, multi-scale gradient magnitudes are non-linearlycalculated for image I and stored in accumulator A:

A =n∑

i=1

((I ⊕ si)− (I si)) . (1)

3) Local Binary Pattern (LBP) gradient: Rotation-invariant uniform Local Binary Patterns LBP riu2

P,R [15]are used to create a filter, which is calculating theLBP for each pixel position and testing it for being atexture primitive such as edge or corner or not. Theassumption is that all LBPs, which are not textureprimitives, are the result of noise. Hence, they are notconsidered for gradient calculation. For all acceptedpixel positions, the local variance V ARP,R [15] of theLBP neighbors is calculated as gradient magnitude: Pdenotes the number of LBP neighbors and R the LBPradius. By calculating multi-scale LBPs [15] and usinglocal standard-deviation instead of variance, higherrobustness is achieved in accumulator A:

A =Rn∑

r=R1

√V ARP,r , if LBPP,r accepted. (2)

Contour pixels are detected in the accumulators by astandard connected-component labeling algorithm supportedby quantile-based adaptive thresholding. This way, a binaryimage is generated, which is post-processed by morpholog-ical closing to fill holes in the object contours or blobs. The

cluster accumulatorconnectedcomponents

morpholog.closing

objecthypotheses

Figure 3. Example for correct object segmentation (red) for a local featurecluster (cyan) with under-segmentation.

best fitting bounding boxes to these blobs are the final resultand the whole process is visualized in Fig. 3.

Difference images [4], [5], [6], [7] can be used for objectblob calculation, too, but especially slow vehicles need thedifference of many consecutive images to create a continu-ous object blob and avoid over-segmentation, while objectsdriving in convoy may cause under-segmentation. Insteadof connected-component labeling we also tried watershedsegmentation [16]. But due to lack of contrast betweenobject and background, not rarely the whole region wasflooded.

C. Spatial Fusion

Since all calculated gradient features are similar, butnot identical, spatial fusion is implemented by writing allgradient magnitudes to one common accumulator. Therefore,we first apply normalization of all feature-specific accu-mulators by mapping the accumulation values to the valuerange interval [0; 255]. Then, the values are added pixelwiseand stored in the common accumulator. An alternative ispixelwise multiplication, which performed slightly worse inour tests. Over- and under-segmentation cannot be totallyavoided, but a significant improvement is reached comparedto local feature clustering as we show in Section V. Spatio-temporal fusion [17] is performing even better, but up tonow the approach does not run in real-time.

IV. MULTI-OBJECT TRACKING

With multi-object tracking, stable object tracks areachieved and further improvement of the segmentation re-sults is investigated especially in cases where vehicles over-take each other. Spatial information provided by segmenta-tion and motion information provided by the local features isfused to handle such situations [18]. We decided for Kalmanfilter since object and camera motion is mostly linear inour application. Furthermore, it is easy to implement andfast. Five parameters are tracked by the Kalman filter: objectcenter (x, y), size (w, l), and orientation α.

A. Assignment of Regions to Tracks

We call the resulting oriented bounding boxes of objectsegmentation regions. They are assigned as measurementsto already existing tracks and also used to initialize newtracks. A region is assigned to a specific track, if a min-imum threshold for the bounding box intersection area ofregion and Kalman prediction is exceeded. The thresholdis chosen small for high tolerance during validation gating.If this assignment is ambiguous for one or more regionsor tracks, split and merge handling has to be applied (seeSection IV-C).

B. Assignment of Features to Tracks

Local features are assigned to already existing tracks only,but not used for track initialization. The idea is to takethem as support for split and merge handling. There arefour criteria for local feature assignment [18]: 1) the featureis not assigned to any track, 2) its position is inside theKalman prediction, 3) its position is not inside of anotherKalman prediction, and 4) it has similar motion (magnitudeand direction) as the track.

If a feature is assigned, the related track parameters andthe relative position within the track bounding box are storedfor measurement reconstruction in case of split or merge.There is a maximum limit of 20 assigned features per track.Outliers with respect to position or motion are removed fromthe set.

C. Split and Merge Handling

Merge handling is needed mainly in overtaking situations,where object segmentation is not able to split the objects cor-rectly. Each assigned feature reconstructs the measurement(region) of its track using the stored track-related parameters.This set of reconstructed measurements is fused with medianfilter for more stability. The power of this approach isdemonstrated in Fig. 4. Four objects are under-segmentedin the same cluster (left cyan cluster). Object segmentationis only able to segment regions, where the upper two and thelower two objects are still under-segmented. Merge handlingis able to guarantee correct tracking (green boxes) based onthe earlier assigned local features (green dots). Unassignedfeatures are visualized as yellow dots.

Split handling is necessary in overtaking situations, wherealready merged objects enter the camera’s field of view.One track is initialized and during overtaking it is verydifficult to split the objects. However, as soon as the regionsare split correctly by object segmentation, the track willconcentrate on one of the regions after some time and a newtrack is initialized for the other region. This process can beaccelerated by assigning local features directly to the regionsto estimate and compare their relative motion. If the relativemotion difference is big enough, the track concentrates ononly one region earlier. Furthermore, the split regions of oneobject, which can be a failure of object segmentation, have

local featureclustering

multi-objecttracking

originalimage

Figure 4. Over-/under-segmented local feature clusters (cyan) withincorrect segmentation (red), but correct split/merge handling for tracking(green boxes) using assigned local features (green dots).

similar motion and, thus, are correctly merged and assignedto one track. An example is given in Fig. 4 for the rightcyan local feature cluster.

D. Tracking with Kalman Filter

As soon as an unambiguous assignment of measurements(regions) to tracks is achieved, the Kalman filter is applied.The bounding boxes after Kalman update, which are consid-ered as stable objects after some tracking time, are the finalresult of the whole processing chain. Kalman prediction isperformed for all tracks for the next time step. If a track didnot get any assigned region or local feature, it is kept alivefor a few time steps using Kalman prediction before it isdeleted.

The camera motion parameters are used to set up thecontrol vector for the Kalman filter. This way, camera motionis considered for Kalman update and prediction.

V. EXPERIMENTAL RESULTS

The main test sequence consists of 370 frames with aresolution of 687 × 547 pixels. Along the sequence 43moving objects appear including several split and mergesituations. Standard vehicle size is about 15× 5 pixels. Theevaluation is split in two parts: experiments for the stabilityof the local image features as well as the completeness andprecision of object segmentation and tracking.

A. Evaluation of the Local Image Features

In summary, 5401 different moving features were detectedand tracked during the whole test sequence. The meanlifetime of each feature was 21.46 frames and the upperhistogram of Fig. 5 shows the distribution of the featureswith respect to their lifetime. The first bin contains allfeatures with a lifetime of 10 frames or less, the second 11to 50 frames, and so on. For better visualization, the verticalaxis scale is logarithmic. There are 221 features, which havea lifetime of 100 frames or more.

Along the test sequence, there were 8863 assignments oflocal features to tracks. Several features have been assigned

0 10 50 100 150 200 250 300 370tracking time of local features [in frames]

100

101

102

103

104

num

ber

of

loca

l fe

atu

res

0 10 50 100 150 200 250 300 370assignment time of local features to tracks [in frames]

100

101

102

103

104

num

ber

of

loca

l fe

atu

res

Figure 5. Lifetime (yellow) and track assignment time (green) for all localfeatures during the test sequence of 370 frames.

Table IEVALUATION OF OBJECT SEGMENTATION COMPLETENESS: CORRECT,

UNDER/OVER-SEGMENTATION (US/OS), AND MISS RATES.

method correct US OS missmulti-object tracking 0.93 0 0 0.07

spatial fusion 0.80 0.08 0.04 0.08morphological gradient 0.78 0.09 0.03 0.10

LBP gradient 0.79 0.10 0.02 0.09Korn gradient 0.77 0.08 0.05 0.10

local feature clustering 0.56 0.35 0.09 0

more than once, especially if they were in the track’s borderarea being sometimes inside or outside of the Kalmanprediction. In the lower histogram of Fig. 5, all featuresare counted with respect to the time of being assigned to atrack. 44 features were assigned to a track for 100 framesor more. Since there were 20 Kalman tracks with a lifetimeof 100 frames or more, this means more than two localfeatures accompany each track for its’ whole lifetime. Eachof these long-living tracks had 15.57 assigned features (20 ismaximum) and 2.3 feature adds/losses per frame on average.

B. Evaluation of Segmentation and Tracking

Segmentation and multi-object tracking were evaluatedfor completeness and precision. 15 objects were manuallylabeled for position and size during 100 frames. Table Ishows the completeness. Instances of an object being found,under-segmented (US), over-segmented (OS), or missed arecounted. This means that two merged objects (US) arecounted as two mistakes while one object with two segments(OS) is counted as one mistake. There are 56 % correctlyfound, 35 % under-segmented, and 9 % over-segmented ob-jects for local feature clustering. The correct rates improvefor the single object segmentation approaches and the fusion.Finally, there is no under-/over-segmentation for trackingand 93 % correctly found objects.

Table IIEVALUATION OF OBJECT SEGMENTATION PRECISION: MEAN ERRORS IN

PIXELS FOR POSITION (x, y) AND SIZE (w, l).

method ex ey ew el

multi-object tracking 0.95 1.97 4.48 8.39spatial fusion 1.31 2.68 5.62 9.93

morphological gradient 1.29 2.94 5.55 10.04LBP gradient 1.44 2.58 5.78 10.53Korn gradient 1.33 2.99 5.82 10.42

local feature clustering 2.32 5.37 8.68 26.69

Table II shows the precision represented by mean errorsfor position x and y as well as width w and length l.Under/over-segmentation is producing the highest positionand size errors. Hence, local feature clustering performedworst. Like in the evaluation of completeness, the results im-prove for the segmentation algorithms as well as the fusion.Highest precision is achieved by multi-object tracking witha mean error of 2.2 pixels for position, 4.5 pixels for width,and 8.4 pixels for length. When considering the knownGSD, this corresponds to mean errors of 0.76 m for position,1.55 m for w, and 2.9 m for l. Vertical object shading causesthe immoderate error difference between w and l. Exampleresults for each processing chain layer are shown in Fig. 6.

VI. CONCLUSIONS

In this paper, a processing chain is presented for precisetracking of multiple moving objects in UAV videos. Localimage features are detected and tracked for frame-to-framehomography estimation. Stationary features are used forthe compensation of camera motion and moving featuresto detect and cluster independent motion for initial objecthypotheses. These hypotheses are improved by advancedgradient-based object segmentation algorithms, which arespatially fused for higher robustness. Finally, multi-objecttracking is introduced using the object segments (regions) asmeasurements for a Kalman filter, the moving features forsplit and merge handling, and the camera motion parametersas control vector for the Kalman filter. In application withour UAV data, we achieved 93 % correctly detected movingobjects and mean errors of 0.76 m for position, 1.55 m forwidth, and 2.9 m for length estimation.

REFERENCES

[1] A. G. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, andW. Hu, “Multi-Object Tracking Through Simultaneous LongOcclusions and Split-Merge Conditions,” in Proc. of the IEEECVPR, New York, NY, USA, 2006.

[2] J. Shi and C. Tomasi, “Good features to track,” in Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition (CVPR), Seattle, WA, USA, 1994.

[3] X. Cao, J. Lan, P. Yan, and X. Li, “KLT Feature Based VehicleDetection and Tracking in Airborne Videos,” in Proc. of theIntern. Conf. on Image and Graphics, Hefei, China, 2011.

original image

local features and clustering

object segmentation

multi-object tracking

Figure 6. Example image (street) with independent motion detection (yellow vectors), local feature clustering (cyan boxes), object detection (red boxes),and multi-object tracking (green boxes) including assigned local features (green dots) and not assigned features (yellow dots).

[4] R. Kumar, H. Sawhney, S. Samarasekera, S. Hsu, H. Tao,Y. Guo, K. Hanna, A. Pope, R. Wildes, D. Hirvonen,M. Hansen, and P. Burt, “Aerial video surveillance andexploitation,” Proc. of the IEEE, vol. 89, no. 10, Oct. 2001.

[5] F. Yao, A. Sekmen, and M. J. Malkani, “Multiple movingtarget detection, tracking, and recognition from a movingobserver,” in Proc. of the IEEE Intern. Conf. on Informationand Automation (ICIA), Hunan, China, Jun. 2008.

[6] A. W. N. Ibrahim, P. W. Ching, G. Seet, M. Lau, and W. Cza-jewski, “Moving Objects Detection and Tracking Frameworkfor UAV-based Surveillance,” in Proc. of the Pacific-RimSymposium on Image and Video Technology, Singapore, 2010.

[7] J. Xiao, H. Cheng, H. Sawhney, and F. Han, “Vehicle De-tection and Tracking in Wide Field-of-View Aerial Video,”in Proc. of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), San Francisco, CA, USA, 2010.

[8] V. Reilly, H. Idrees, and M. Shah, “Detection and Trackingof Large Number of Targets in Wide Area Surveillance,” inProceedings of the 11th European Conference on ComputerVision (ECCV), Heraklion, Greece, Sep. 2010.

[9] T. N. Mundhenk, K.-Y. Ni, Y. Chen, K. Kim, and Y. Owechko,“Detection of unknown targets from aerial camera and ex-traction of simple object fingerprints for the purpose of targetreacquisition,” in Proc. SPIE Vol. 8301, 2012.

[10] R. Hartley and A. Zisserman, Multiple-View Geometry inComputer Vision. Cambridge University Press, 2004.

[11] M. Irani and P. Anandan, “A unified approach to movingobject detection in 2D and 3D scenes,” IEEE Transactionson Pattern Analysis and Machine Intelligence, vol. 20, no. 6,pp. 577–589, Jun. 1998.

[12] N. Heinze, M. Esswein, W. Kruger, and G. Saur, “Automaticimage exploitation system for small UAVs,” in Proceedingsof SPIE Vol. 6946, 2008.

[13] A. Korn, “Toward a Symbolic Representation of IntensityChanges in Images,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 10, no. 5, pp. 610–625, 1988.

[14] J. S. J. Lee, R. M. Haralick, and L. G. Shapiro, “Morphologicedge detection,” IEEE Journal of Robotics and Automation,vol. 3, no. 2, pp. 142–156, Apr. 1987.

[15] T. Ojala, M. Pietikainen, and T. Maenpaa, “MultiresolutionGray-Scale and Rotation Invariant Texture Classification withLocal Binary Patterns,” IEEE Transact. on Pattern Analysisand Machine Intelligence, vol. 24, no. 7, pp. 971–987, 2002.

[16] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal ofSoftware Tools, 2000.

[17] M. Teutsch and W. Kruger, “Spatio-Temporal Fusion of Ob-ject Segmentation Approaches for Moving Distant Targets,” inProceedings of the International Conference on InformationFusion (FUSION), Singapore, Jul. 2012.

[18] M. Teutsch, W. Kruger, and J. Beyerer, “Fusion of Region andPoint-Feature Detections for Measurement Reconstruction inMulti-Target Kalman Tracking,” in Proc. of the Intern. Conf.on Information Fusion (FUSION), Chicago, IL, USA, 2011.

Year:2012

Author(s):Teutsch, Michael; Krüger, Wolfgang

Title:Detection, segmentation, and tracking of moving objects in UAV videos

DOI: 10.1109/AVSS.2012.36 (http://dx.doi.org/10.1109/AVSS.2012.36)

© 2012 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Details:Institute of Electrical and Electronics Engineers -IEEE-; IEEE Computer Society:IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, AVSS 2012. Proceedings : 18-21 September 2012, Beijing, ChinaLos Alamitos, Calif.: IEEE Computer Society Conference Publishing Services (CPS), 2012ISBN: 978-0-7695-4797-8ISBN: 978-1-4673-2499-1 (Print)pp.313-318

detection, segmentation, and tracking of moving...

Documents