tracking with camera

8/2/2019 Tracking With Camera

1/11

O B J E C T T R A C K I NG WITH A M O V I N G C A M E R AAn Applica t ion of Dynaiiiic Moti on Analysis

P. J . Burt, J. R . Bergen. R . Hingorani, R . Kolczynski\Y. . Lee, A. Leung, J . Liibin, H. SlivaytserDavid Sarnoff Research CenterSubsidiary of SR I Iut.ernationa1

Princet,on, NJ 08543-5300

AbstractThe task of detect.ing and tracking moving 0bject.s is

part.icularly challenging if it. niust be perfornied with a cam-era that is itself moving. Yet, in applications such as auto-mated surveillalice an d navigation, thi s task niust be per-foriiied continuously, in real time, and using only niodestcomputing hardware.

Dynaniic analysis techniques provide a key t o real-timevision. Through strategi es analogous to foveation and eyetracking in humans , these tecliniques direct. analysis to crit-ical regions of a scene, and decompose tlie complex niotionproblem into a sequence of relatively simple tasks.

I. IntroductionA vision system can easily detect and track objects

that are moving relative to an otherwise stationary back-groun d. But if this task must be performed with a camerat,liat. is itself moving, it becomes qui te challenging.

Camera induced scene motion is oft,en significantly larg-:er blian t,liat of t,he objects of interest. This means analysismust. be precise if it is to separate objects from background.At. t,lie same time, camera induced motion can be complex,clue t,o niot,ion parallax.

In practical applicatiolis, the challenge is compoundedby t,lie fact, t,liat analysis niust be performed coiit,inuously,ii i real t,ime, and wit.11 only modest, coniputing liarclware.For example, syst.enis for surveillance or navigation must

process enoriiious amounts of data. Yet these systems inustmeet severe const,raint ,s n power, weight and cost.

Dynamic nna!ysis t,ecliniques provide a key to practi-cal, real-t,inie vision. Tliese techniques direct, computingresources dynalilically to critical regions of a scene, t,rans-foriiiiiig t.lie coniplex motion aiialysis t.ask into a sequenceo f simpler tasks.

Two basic strategies of dynamic analysis are familiarbo us from our own visual experience. Consider t,he surveil-laiice t a s k illusi.rat,ed n Figure 1. A n airborne vision systeiiiis required t,odet.ect and hack a vehicle nioving over a lanrl-scape. A human would not perform this task while st.aringin a fixed direction from t,lie airplane wiiidow. Rat,lier lie

\ V

Imalnwlth movlngobycl

- planar surface modelFigure 1. Example task: locate a car nioving overa landscape.

would foveate a point on the landscape, then rotate his eyesto liolcl this point in tlie center of gaze. Eye iotation, ortracking, stabilizes a portion of the landscape on his ieti-nae, so that any vehicle moving relative to the landscapecan be readily detected. He would then redirect liis gaze tofoveate an d track tlie vehicle.

I n a computer vision system analogous foveation andtracking strategies greatly reduce the aniouiit of data thatmust be processed and tlie coniplexity of t lie analysis itself.However they are iiiiplenientecl not by physically rotatingtlie camera, but through selective processing of tlie signalobtained from tlie camera.

In this paper we first outline a coniputational franie-work for dynaiiiic motion analysis. We then describe ouriiiiplenientation of niajor components of this system. Sin-dation result s show that these techniques provide precise,robust results for the object trackiug task. We briefly ex-amine tlie issues o f hardware iniplenient ation, a i d concliidethat a practical system could be constructed t o perform thist a s k at video rates.

Although we address t lie probleni of object tracking,the same dyiiaiiiic motion aiialysis techniques ale apyro-priate for otliei leal-time vision tasks, sucli as auionoiiiousvehicle iiavigat ion.

LCH2716-9/89/0000/0002$01.000 1989 IEEE


2/11

LocalAnalysisGlobal Analysis FocalAnalysis

Image hame.F f ,wilh a roadanda sign.A global motion mouel.M. s built from a sequ enceof focalprobes. Largeareas are represented at lowresolution and uiW regions at higher resdution.

Motion within the focalanalysis region is representedasa few coherently movingsurfaces. The Eh urface ha sVelodlY vk and SUppOtlsk .Figure 2. A framework for dynamic iiiotioii analysis.

Flow vectors.v( i j ) , recomputed between imageframes F f . and F,Eachvector is computedwilhin a localwindow w.

11. A Framework for Dynamic Motion AnalysisMotion analysis must be considered in the context of

tlie larger vision task it supports, such as surveillance ornavigation. Dynamic analysis techniques then achieve speedand precision by focusing analysis selectively within regionsof a scene that are critical to this task, and by moving tlieanalysis region dynamically as events unfold.

Here we outline a coiiiputatioiial franiework for a dy-naniic motion analysis system. Tlie frainewoik also sup-ports liiglily efficient analysis algorithms, and provides amatch to special purpose hardware.

This framework divides analysis into t h e e levels, assliown in Figure 2. At tlie lowest, or 'local', level motion isanalyzed and described in te rnis of a basic flow field, whileat tlie highest, or 'global', level it is described in terms ofsegmented, coherently nioviiig surfaces. A n intermediate,'focal', level inipleineiits dynamic analysis strategies analo-gous to foveation and tracking.

Processing begins at the local level. Successive framesof tlie image sequence, Ft-1 and F, , are analyzed to obt ainan array of 'flow vectors', L * ( z , J ) . Each vector representsframe to frame clisplacement at tlie corresponding pointin frame F, , a i d is based on coinpiitations within a localwindow, W.

The flow vectors are iiiterpreted at tlw focal level ofanalysis i n terms of extended regions of coherent, surface,inotioii. Focal analysis is restricted to a 'focal analysis re-gion', R. But this region is moved from moment to iiioinentin a sequence of focal probes to examine motion in differentpoitions of the scerie, as required by the vision task. As theanalysis region moves, i t cliaiiges size and exaiiiines da ta ofdifierent resolutioii, thus providing coarse estiiuates of m o -t i o n over extended area s of the visual field, or detailed es-tiiiiates within sinal1 selected areas. In general, image datais represented at a saiiiple density that is proportional toresolution, ancl resolution is inversely proportional to the

size of R. Hence the samples within R, and the cost ofcomputati ons, are roughly th e same for foveal probes of allsizes. Depending on the demands of the vision task, andthe capabilities of the vision iiiachine, there inay be one ortens of focal probes per frame time.

The focal analysis region is limited in size a nd res-olution. Motion within t he analysis region can thereforebe represented as a sinall number of coherently movingsubregions corresponding to surfaces in the physical world.Tlie k t h such component is specified by a velocity fu n c -tion, T/k(Z, y), that incorporates translation, dilation, rota-tion and skew, and a support region, sk, that indicates theset of local flow vectors consistent with iiiotioii vk. Tlie 6parameter function, vk, describes a planar (tilted) surfacemoving relative to the camera.

Focal analysis serves as a basic bidding block for con-str uct ing a model for motion over the full field of view.This model, M , is maintained at the global level of theanalysis framework. It includes both velocity and segmeii-tation information. The model is developed and updatedcontinuously, in real time, but includes precise, detailed es-tim ate s of scene motion only where these are required bytlie analysis task.

A dynamrc co nfro l process determines the parametersof successive focal probes. Based on requirements of tlievision task, and tlie outcome of previous probes as indicatedin the inotion model, this process determines where tlie nextprobe should be directed and its size and resolution.

Th e three levels of tlie dynamic iiiotion analysis franie-work can be ideiit ified with different components in a pliys-ical iiiipleinentation of the system. The global model, h l ,represents motion in the format required for the vision task.Local analysis is homogeneous over tlie analysis region, so isamenable to iiiipleineiit atioii in special purpose hardware.Focal analysis, based on highly efficient dynamic analysisalgorithms, inay be iinpleinented within a general purposecomputer.

3


3/11

In this paper we describe implement ation of tlie localan d focal levels of thi s franiework. We do not include aniiiipleiiieiitatioii of the dynamic control process or the globalmotion model. A I .

111. Focal AnalysisDynamic analysis decoinposes tlie complex task of an-

alyzing motion within an extended scene into a sequence ofsimpler analysis stages. The deconipositioii is first in teinisof tlie domain covered, as the focal probes operate on se-lected pieces of tlie overall scene. Within tlie focal analysisregion tlie analysis task is further simplified tlirougli sucht ecluniques as two-stage motion computation, tracking, andone-component-at-a-time segmentation.1. Notation

Dynamic aiialysis is an evolving process that takesplace over exteiided sequences of images. We consider anal-ysis only within tlie focal analysis region, R, etween framesFf-1 an d F t .Processing is implemented within a pyramid st ructure,to provide ready control of resolution aiid sample density,as well as to control the iiiiage domain over which focalanalysis is performed.

Analysis within tlie focal region is based on tlie flowvectors, v( , ~ ) ,omputed at tlie local level of processing.These represent translation between successive frames, soc a n be stated as V = ( Z , J ) and v y ( 2 , j ) , lie displacements inth e T an d y directions, respectively. Similarly, the L t h co-herent motion component identified in the focal region, v k ,is a vector function composed of displacements Vz and V yin z an d y.

We take tlie center of the analysis region, R , to betlie origin of tlie coordinate system used in focal analysis.111 general, we use continuous variables (2, ) to indicateposition at the resolution of the original image, and integerindices ( t , ~ )o indicate positions of tlie discrete samplepoints. At level e of a pyramid the sample distance is 2.rilus r ( t , j ) is located at position (z= 22,y = 32) relativeto tlie center of R.2. Two-Stage Motion Computation

Suppose for the moment that we wish to approximatemotion within tlie analysis region by a single coherentlymoving surface with velocity V. The associated supportregion, S, then includes all samples in R.

A s we have indicated, this derivation is performed intwo stages. First local motion flow vectors, v, re computedat a uniform array of points within tlie analysis region.Then the coherent motion V is determined by fitting anappropria te motion model to tlie local vectors.

The coinpiit ation of flow vectors, v , will be describedin Section IV. For the present we assume these vectors areavailable.

Tlie motion V of a planar surface relative to tlie cameraran be specified by components V z and V y that are linearin T an d y:

I / *= (T ,) = a x + by + c .T-y(.,y) = dx + y t f . (1)

The six parameters, a , b, ..., .f, indicale a pari ic l t lnl 5111-face inoiion. Values are assigned to these parameters illatiiiininiize tlie squared difference Let ween 13 and 1-, ie., Ihatminimize the two errors E, and Ey:

Ez = ( u z ( i , j )- V z ( ~ , y ) ] ~

( A similar formulation of surface motion was used by Adiv121.1

It is not necessary to implement the derivation of co-herent motion in two stages, as is done here. For example.Rom, et. al., compute the translation aiid rotation compo-nents of V directly from tlie image data [20]. However atwo stage implenientatioii has several potential advantages.First, as we have already observed, it provides a matchto computing resources. Tlie first st age computation is lo-cal and homogeneous over the analysis region, and hence isamenable to implementation in special purpose hardware.The second stage computation is global to tlie analysis re-gion and is more complex than local analysis, but it entailstlie manipulation of much reduced data, so is suited forimplenientation in a general purpose computer.

Second, the two stage computation provides a mecli-anism for balancing contributions to V over the analysisregion. In a single stage computation, an isolated high con-trast feature can dominate tlie estimate, eveii though itrepresents a small fraction of R. In tlie two stage coinpit-tation, all local vectors can contribute equally, or they canbe weighted to reflect confidence.

Thi rd, t he two stage description provides a mechanismfor determining whether observecl motion in the analysisregion can be reasonably well approximated as motion ofa plana r surface. If the first stage flow vectors have highconfidence (ie., if local image dat a has good contrast andframe to frame displacements are not too large), yet do notfit well to tlie second stage planar surface description, tlieiimotion cannot be represented by a siiigle surface at thescale and resolution of current analysis region.

Four th, tlie two stage implementation provides an iii-terriiecliate level of description at wliicli image segment ationcan be performed. This will be described below.

Finally, it should be noted that a model-fitting ap-proach to motion analysis provides a dense motion fielddirectly, even though pattern information may be availableonly at scattered points within the analysis region.

4


4/11

3. TrackingTh e cost of directly computing estinia tes of local m o -

tion grows rapidly with tlie range of velocities a systemmust handle and with the precision that it must achieve.Indeed, local analysis algorithnis are often liiiiited to esti-mating frame to frame displacements that are less than onesample interval.

However, even large velocities can be detected withhigh precision if tlie computat ion is performed as a sequenceof tracking and refinement steps. This procedure is analo-gous to eye tracking in human vision, and serves to reduce,and ultimately zero, motion of selected image regions.

To begin, local flow vectors are computed at low resolu-tion. Since low resolution samples are far apar t, this ensiiresthat displacement is less than a sample interval. These flowvectors are used to compute tlie initial, crude, estimate ofcoherent surface motion. The est imated surface motion isused in tur n to shift (trans late , rotate, dilate, etc.) tlie firstimage frame relative to the second within tlie analysis re-gion, thus reducing tlie frame to frame displacement. Localanalysis can now be repea ted at higher resolution to obtai na refined estimate of motion. These steps are iterated untilsufficient precision is achieved.

Let v, and V , be the estimate d local and surface mo-tions within tlie region R after m iterations of tlie trackingprocess, and let F t - l , , be tlie shifted frame Ft-l after miterations.

Initially we let Vo = 0 and Ft-l,o = Ft-l. Then fornz 2 1 coinputations follow four steps (see Figure 3):1. Compute residual f lo w. Local flow vectors, AV,, arecomputed from Ft-1,,-1 and Ft.2. Compute restdual surface m o t i o n . A residual motion,AV,, is computed fioni tlie flow vectors (Eqs. 1,2).3. Refine surface motton esfrmate. The coherent motionestimate is updated:

V , = Icn-1 + AV,.4. A p p l y shift process. Frame Ft- l is shifted t,owardsframe Ft in accordance with the current estimate of surfacemot,ion:

This shift is performed with (bilinear) interpolation t o ob-t aiii subpixel accuracy.

Again, these analysis steps typically begin with lowiesolution image data and move to higher resolution datawit Ii successive iterations.

This tracking process replaces the deniandiiig task ofdirectly computing large velocities, with a short sequenceof relatively siiiiple refinement steps . Th e same or betterprecision is achieved with niuch reduced cost.

Mult iresolution and coarse-to-fine techniques have beenused extensively in mot ion analysis (eg.,[3][7][11](12][14]2G])Successive refinenlent is also commonly wed, but most of-ten as a relaxation process to solve a large set of simultane-

*% 1 ICanpule ResidualSurlace Mdwn

Figure 3. Feedback diagram for basic tracking.ous equations (eg. [5][6](15]).Such techniques differ fromtracking in that the image data are held fixed through thesequence of iterations. Here tlie data are changed with eachiteration, as one image is shifted relative t o the next. Im-age to image displacement is made small, so that simplefirst order computations can yield very precise estimatesof motion. (Techniques similar to tracking are proposed in

It should be observed also that tracking provides a sim-ple and effect.ive means for resolving aperture ambiguitiestha t a re common in motion analysis. Local motion esti-mates can only be obtained accurately in the direction ofthe local intensity gradient , perpendicular to edge-like fea-tures in the image. Here niodel fitting at the level of focalanalysis combines local measures to obtain a coniplete spec-ification of motion. If it happens that image features withinthe analysis region fall within a narrow range of orient.ation,tlie initial estimate of coherent motion will be biased in thedirection perpendicu lar to this orientation. Still a few re-finement steps may correct initial errors in the directionparallel to feature orientation.4. One-Component-at-a-Time Segmentat ion

In defining the tracking procedure above we assumedthat the analysis region contained a single coherent mot ioncomponent. In practice it will often contain multiple differ-eiit.ly moving components. Only one of these can be trackedat a time. In this case tracking serves segmentation as wellas velocity estimation functions, determining not only thevelocity of a component, v k , but its support. area, s k .A C O I I I I I ~ O ~ituation is illustrated in Figure 4. A rel-atively small object, here a car, is seen moving relative toan extended background. Local motion vectors are similarover most of tlie analysis region, and are due to the cameramotion.

POI.)

5


5/11

In order to locate and track tlie car, it is expetlirntfirst to estiinate raniera niotioii. This is acliiered by apply-ing the tracking algorithm to the majoiity motion in thescene: flow vectors tha t diflrr significantly from the res t,such as those around the car, are not iiicludetl in the esti-inatioii process. Tracking locks to background inotion, assuggested in the figure. The car is then readily detec ted,a i d inay be foveated or tracked in a second application ofthe traclciiig procedure.Multiple coherent niot ion coinponents within the anal-ysis region can be determined tlirougli repeated applica-tions of this nmp r i t y t racking nlgorrthna. After the l c t h coni-ponent has heen identified, a Set of locd flow vectors, s k , isdeterinined that are not consistent with previously identi-fied motion components. The algorithiii is then applied tovectors in this set, further dividing it into a majority set,Sk+l, epresenting another coherently moving component,mid a remainder, Sk+l .

The segmentation steps are combined with velocity es-timation in the tracking al gor ith i (Figure 5 ) . Let vk , mand ask,,,, e the velocity and support regions estiinated forthe k i h motion component after m refinement steps. Thetracki ng procedure defined above is refined as follows:1 . Compute residual f low . Local flow vectors A V , arecomputed from Ft-l,,-1 an d Ft.2. Compute residual surface mot ion . A residual motion,AVk,,, is computed from the flow vectors in the supportregion Sk,m-I (Eqs. 1,2).3 . Refine surface mo f io n e s t ima te . The estimate of the k t hsurface coiiiponent is updated:

The support. region is also updated, to include flow vectorsnot. in previously identified components, and which differform the estimated motion AV,, , by less than a thresholdT:

4. A p p l y shiff process. Frame Ft- l is shift,ed based on thecurrent esti mate of coherent. mot,ion:

These steps are repeated until es timates of support andvelocity become stable.

Again. one-component-at -a-time segment ation can bedescribed as a technique for decomposing a complex anal-ysis task into a sequence of simpler tasks. It can be verydifficult to identify a iiuinber of differently moving imagecoinponents siniult aneously, particularly when some have asmall support area, and differ in velocity only slightly froinlarger, neighboring components. Such small differences areeasily masked by noise in estimates of the local flow vectors.

I l l = I.........r........r...: , , / / :;.... ....:.... .....:....I -: , , , ,:....: .__.:.... .... ....;/-:/:/:.-:/:...............__.,....... .;_....;..___;.


6/11

Figure 5. Coinbined tracking and segmentation.

1. Local Correlationn th e analysis region. The systems tracks one (tlie domi-nant. ) component over several frames, and difference imagesare foriiiecl for tliis sequence. The pat.tern tha t is trackedcancels in the difference sequence, while tlie other remains.Th e second inotion can t.lieii be identified by applying tlietrack ing algor ithm t,o t.lie difference sequence.

IV. Local Motion EstimationAt tlielowest level of analysis, local flow vectors, v ( i , j ) ,

are computecl between iniage frame Ft-1 and F t, within tlieanalysis region R. Each vector represents frame to framedisplacement and is computed within a local window, t o .Window size and the resolution of image data used in esti-niating motion vary with tlie size of the analysis region andwith requirements of the vision task.

Various techniques may be considered for computingth e local flow vectors. The technique selected should be de-fined within a pyramid structure to provide ready control ofresolution and wiiiclow size, and to provide a match to nio-tion processing at tlie focal level. The technique should alsobe su ited for iiiipleiiieiitation in special purpose hardware.

It is not critical th at the local computations yield par-ticularly accurate iiiotion estimates, but errors should be-come sniall as frame to frame displacenieiits become small.High precision is then readily achieved tlirougli iterativerefinement and tracking.

We have inipleniented several different local analysisalgorithms. Tliese yield virtually identical results whencombined with focal analysis and applied to example im-age sequences. Here we outline two algorithms used in theexainples described in tlie next section.

Motion estimates may be derived from cross-correlationfunctions computed between images Ft- l and Ft within th elocal windows, w . An efficient, pyramid-based, correlatioiialgorithm is shown in Figure 6. This coniputes local corre-lation for all window locations and for a range of windowsizes siniultaneously 111. There a re six steps in tlie compu-tation:1. Laplacian pyramcd constructzon. Laplacian pyramidse , an d L t are constructed for the source iinage Ft-1 andF t. (Here a 'hat' is used to designate the pyramid derivedfroin the first image of t he pair, Ft-1. The subscript indi-cates pyramid level IS].) This has th e effect of decomposingeach image into a set of bandpass components. By com-puting cross-correlation measures between bandpass imagestlie measures are made largely insensitive to changes in il-lumination.2 . Select bandpass leuel f l . A level C1 is chosen at whichthe sample distance is just larger tha n the largest expecteclresidual motion displacement between frames. Typically 81will decrease in successive itera tions of the analysis process,as the residual motion becomes small.3 . Shift and mulf ip ly . Only cross-correlation estimatescorresponding to image shifts of plus and minus one levelC l sample distance need be computed; expected image dis-placements are snialler than tliis distance. The coinpu-tat ion begins with th e formation of nine product i m a g c s ,Pmnc,. For shifts m,n = -l ,O,l:

7


7/11

1. Build 2. Select 3. Shift and 4. Integrate 5. Form CorrelationLaplacian level 1, Multiply locally functionPyramid 6. Select level

Figure 6. Fast procedure for coniputhg local cor relation functions.(Th e superscript d indicates that these pyramid levels arerepresented at 'double' sample density [9] )4. Local Integration. Th e correlation values for local im-age windows can now be computed simultaneously for allimage regions through the construction of a Gaussian pyra-niicl for each product image. Let Pmne,,c be tlie P t h levelof this integration pyramid for Pmne,. Each sample ofPnznc, ,e represents tlie result of integrating samples of thecorresponding product image within a window, tu(, that isGaussian-like in shape, and that doubles in size with eachincrement in level 4.5 . Select integration level 4 2 . A level 42 is selected in whichtlie window size and s ample spacing a re appropriate for thesize of tlie analysis region, R.6. Formation of Correlation Functions. Correlation func-tions a re now assembled for each image window. For thewindow we2 , centered at tlie point ( i , j ) of pyranlicl level(2 , samples of the correlation function with displacements(nz, ) ar e obtained froni the corresponding integration pyra-mids:

cijel,c2 nz, )= Pmne, , t2(i , j ) .A local velocity vector can now be computed for each

image window by finding the position of the peak of th ecorresponding cross-correlation function. This function isrepresented by a 3 by 3 array of samples that straddlestlir peak. A surface (eg., a second order polynomial) is fitto the samples, and the position of its maximum point isdeterniined analytically.2. Phase Shift Motion Measure

A second approach to local motion estimation is basedon a local phase shift computed within the Fourier domain(221. This technique is computationally efficient and siin-ple. It is closely related to spatiotemporal models of humanniotion perception [1][24],ancl to earlier gradient based es-timates of local motion [13][17].

Let p ( m , n )be a patch of frame F,-1 centered at point( 1 , ~ ) ncl within a rectangular window tu of width N ancluniform weight: p ( m , n )= Ft-l(i+nz, j+i i) . Let q ( m , n ) ethe corresponding patch of frame F t . The discrete Fouriert ransforni of p is P:

Siniilarly let &(U, U ) be the transform for q.Let Ax and Ay be the x and y components of ve1ocit.y

at. point ( i , j ) of frame Ft. Then p(nz,n) = q( m + AX,^ +Ay), iicl (ignoring boundary effects)

It follows t.liat any two linearly independent points in tliefrequency domain can be used to determine Ax a id Ay.In particular, consider frequency coniponeiits ( U , U)= ( 0 , l )an d (1,O). Then

P(1,O)= Q(l,O)efAzP(O , )= ~(o,l)e**Y ( 3 )

Equations 3 are invariant unde r a constant adtlif ive changein tlie illumination since this can effect only the (0,O) Fouriercoefficients of P an d Q. Invariance under niultiplication isobtained by considering only the phase in Equations 3:

A'2nAx = -[[argP(l,O) -argQ(l,O)]

If P = Pr+ ;Pi, then

By symmetry

Finally velocity at point ( i , j ) s given by VZ( , j ) = Axancl tiy(i,j ) = Ay. These are iiiult.iplied by 2' if computedat pyramid level P.

8


8/11

For siuall neigliborhood size (we use N = 5 ) tlie cal-culations of phase estima tes of inotion required only smallweighting functions and a few arithmetic operations.3. Hierarchical Warp Motion

Coarse-to-fine procedures can be used in the coniputa-tion of flow vectors, 17, at the local analysis level, as well asin the computation of coherent surface motion, V , at tliefocal analysis level. In the following algorithm a coarse es-timate of tlie flow field is first computed a t a low resolutioiilevel of a pyramid representation of the image frames. Theestimated flow is used to warp the first image towards thesecond at the next higher resolution pyrainid level. Tlielocal computation are tslien repeated to obtain a refined es-timate of tlie flow field. This warp motion algorithm wasdeveloped by Bergen and Adelson (51. Similar procedureshave been described by Quam (211 and Dengler (121.

Here we use the warp algorithm for computing flowas the first step in the tracking procedure described above.The warp and tracking procedures are closely related, how-ever. Although applied at different levels of our analy-sis fraiiiework, both procedures obtain precise niotion es-timates by iteratively shifting image dat a and recomputingresidual displacenients.1. B ui l d G a uss i a n p y ra mi d s . Gaussian pyramids Gt andC: c are constructed for image frames Ft-l and Ft, respec-tively.2. Se l ec t l o w p na ~ eiwl e , . A level is selected at which thesample distance, 21, is expected 1.0 be just larger than thelargest displacement between image franies within regionR.3. Compute local f l ow. Flow vectors v t l ( i ,3 ) are coniputedfroin the image pyramids at lhis level, Gtl and Gel. In ourimplementation this step uses the phase shift estimates ofmotion described above, although similar results have beenobtained with a least-square-error estimator [7] 18][ 91.4. A p p l y wntp pTOCJJ. Analysis now moves to level(1 - 1. The image frame Gt,-l is warped in accordancewith estimated local Ilow:

Here t i t l , l indicates the level e1 vector field, t y l , is interpo-lated t o iiiatch the level C 1 - 1 sample density. Interpolationis also used in coinputing the warped image e because dis-placenients are less than a pixel.5 . C o m p u t e r e s id u al mot i on . A residual flow field, A v c I - l ,is computet1 froin tlie warped image etl 1 and the original6. R f f i n c n i o f i o n e s f i n t n f e s . The level C1 niotion estii1iat.eand t he level i -1 residual are conibined to obtain a refinedestimate of local flow:

G e I - 1 (Eq. 1,2).

t ? e , - i ( i , j ) = v t 1 , 1 ( i , j )+ At*e ,- i (Z,j ) .7. Srlcct i~soluttoiteuelC2. Steps 4 to 6 are repeated untila desired resolution level, E z , is reached.

This process assigns a local velocity vector to each Sam-ple point at pyramid level 1 2 . The actual support region,or window, that cont,ributes to the value of the vector canbe quite large, and is deteriiiined by the starting level, Cl.

As noted, this warp motion algorithm is closely relatedto t he tracking algorithm described above. Tlie proceduresdiffer in several important respects, however. For example,they place rather different constraints on lie computed mo-tion. T he tracking algorithm represents motion as surfacesmoving relative to tlie camera, aiid the associated shift op-erator is limited to translation, rotation, dilation a n d skew.The warp algorithm represents a general motion flow, andthe associated warp operat ion can apply an independentmotion vector to each image sample. O n the other hand,the warp process constrains motion not t o change too f a s t froin sample to sample, while th e tracking process can seg-ment the analysis region into several differently moving sui-faces separa ted by sharp boundaries. Also, the w a r p algo-rithm obtains estimates for the entire flow field at once,in a single pass throug h the coarse-to-fine procedure. Thetracking algorithm identifies the motion of surface compo-nents, selectively, one at a time.

V. ExaiiiplesWe have applied tlie dynamic motion analysis tech-

nique to a number of image sequences. An example is ~ 1 1 0 ~ ~ 1 1in Figure 7. Selected images from the original sequence areshown along the top row. In this toy scene, the camera ismoving towards the right at roughly 5. 1 pixels per frame,aiid upwards at roughly 1.1pixels per frame. A tank iiiovesrelative to the background in Frames 9 through 13, at arate of approximately two pixels per frame. The first threeframes of tlie sequence are shown, along with Frame 11.

The second row in the figure shows difference iinagesformed by subtracting each frame froin the next frame inthe sequence, withou t tracking. Due to camera motionthese difference images have relatively large values, ailcl thisobscures motion of the tank in Frame 11.

The third row in the figure shows difference imagesformed between each image frame and the predicted imagebased on the previous frame ancl the systems estimatedcamera motion. In this exainple the analysis region is takento be t lie full field of view, an d only one refinement is coni-puted each frame time. The correlation algorithm is usedto compute local niotioii vectors.

Two aspects of the sequence are of particular inter-est. Firs t, tlie initial three frames show rapid convergencefrom a zero estiniate of background motion, Frame 1, toa nearly perfect estimate at Frame 3. Once tracking haslocked onto the background motion, the difference imagebetween current and predicted franies shows almost coin-plete cancellation. Second, when the tank nioves relative tothe background in Frame 11, its iiiotion stands out clearly.

Figure 8 shows the error between the actual back-ground velocity and the estirnaterl velocity over the se-quence of 24 frames. At the start of this sequence the es-

9


9/11

OriginalSequence

Difference

Tracking

Frame 1 Frame 2 Frame 3Figure 7. Example showing background tracking. First row: original Frames 1,2, 3 and 11. The tank moves in Frame 11. Second row: frame to frame differenceimages without tracking. Third row: frame to frame difference with tracking.

tiiiiate has been set at a default value of zero. Since theactual background velocity is roughly 5 (5.1 to the rightand 1.1 up) the error at Frame 0 is equal to this value. Theprocess converges rapidly to match the camera motion, sothat within three iterations the error is roughly 5% of apixel. The tank' s motion takes place over frames 9 to 13,and averages 2 pixels per frame relative to the background.While the error in background tracking increases to 15% ofa pixel interval in Frame 13, it is clear that the segmmta-tion techniques used in conibining local vectors has largelyeliminated contamination by object motion.

A second exaniple is shown in Figure 9. The top rowis three successive frames of an actual video sequence of ahelicopter flying over a terrain. The sequence was obtainedfrom a caniera moving relative to the helicopter. The sec-ond row shows difference images without tracking and thethird shows differences images as he background (majorityregion) is tracked.In this example the analysis region is again taken to bethe full image frame. The phase shift motion measure andwarp motion algorithms are used to obtain local velocityestimates.

Here again, the helicopter stands out clearly when thebackground is tracked. An interesting point is that the

Frame 11

helicopter's propeller blades can also be seen rotating inthe tracked difference mages, even though they are hardlyvisible to h uman observers in the original sequence. Thisis an example of transparent motion, and demonstra tes thepower of one-coniponent-at-a-timeracking to reveal subtlecomponents of motion in an image sequence.

0.8

' .40.20.0

0 4 6 12 16 20 24frameFigure 8. Error in est imated motion over a sequenceof 24 franies.

10


10/11

OriginalSequence

Difference

Tracking

Frame 1 Frame 2 Frame 3Figure 9. Second tracking example. First row: original helicopter sequence. Secondrow: frame to frame difference images without tracking. Third row: frame to framedifference with tracking.

VI. HardwareThe approach to dynamic motion analysis proposed

here has been motivated in part by architectural considera-tions. Since real-time inotion analysis entails the processingof vast amounts of image data, it is essential that the flowof d at a throug h the system be organized to minimize delaysand storage.

The motion analysis techniques described here are wellsuited for iinplementat.ion within a lattice pipeline arclutec-ture [IO]. We define a lattice pipeline to be a set of imageprocessing pipelines that run in parallel, that may mergeand diverge, but that do not contain loops. Image compu-tations are decomposed into a sequence of e lementary filterand ari thmetic operations. These are performed on imagedata at different resolutions and saniple rates, and withinfocal analysis regions that niove dynamically. All opera-t ions are performed homogeneously over the correspondingarrays of iiiiage data. This means that, at most, only a fewrows of iiiiage data need to be stored at each processingstage to support neighborhood operations, and that datacan move continuously froin stage to stage, in flow-throughfashion.

The ctiagrain for computing local correlation, Figure 6 ,

can be inte rpreted directly as a lattice pipeline. Image dat aenters the system on the left, flows continuously throughand exits on the right. This processing network can beassembled from a set of basic modules that inay be arrangedin different configurations to serve different tasks.

A prototype systein, known as the Pyramid VisionMachine, based on these principles, has been built at theSarnoff Labs [4][23]. This is capable of performing all coni-putations required for the present motion analysis algo-rithm , although not at full video rate. We estimat e thata system capable of processing .512 by 51 2 image frames,at a rate of 30 per second, could be assembled on two PC-AT size circuit boards if custom integrat ed components areused for pyramid construction. Systems of siniilar coinplex-it.y have been implemented on a Data Cube [ lS]and a PIPE[25].

VII. Siiiiiniary and DiscussionThere has been a strong disposition in the coinputervision research comiiiunity to develop parallel approaches

to motion analysis. Only through parallel computat ion,it is argued, can sufficient computing power be harnessedto achieve real-time analysis. However a parallel approachcan make the analysis task considerable inore complex, as


11/11

all coniponeiits of motion must be included at once in thecomputation.

As we have observed here, it is often possible to de-compose a complex motion analysis task into a sequence ofmuch simpler tasks. Processing is arranged as a sequenceof stages, each of which identifies a conspicuous aspect ofscene motion, t hen removes tha t aspect of motion from theimage, thus simplifying the task as presented to the nextstage.

Dynamic motion analysis achieves efficiency tlirouglithis sequential decomposition of a coinplex analysis taskinto simpler tasks, by peeling off complexity, and by di-recting analysis to portions of a scene tha t are most criticalto the vision task.

We have described four basic techniques for iiiiple-nieiit ing dynamic analysis: foveation, two-stage motion con:-putatio n, tracking, and one-component-a t-a-time segmen-tation. Each process entails several iterations of a basic op-eration but convergelice is fast and t he com putations them-selves can be relatively crude.

The dynamic approach to motion analysis holds thepromise of performing real-time processing to obtain pre-cise, robust results, within practical hardware. Here wehave denionstrated application of dynamic techniques tothe probleiii of object tracking with a moving camera. How-ever, the techniques are appropriate, we believe, for othertasks in which real-time performance is essential.

ReferencesE. H. Adelson and J. R. Rergen, Spatiotemporal energy niodelsfor perception of motion, J . Opt . Soc . A m . A , 2 , pp . 2 8 4 2 9 9 ,1985.G. Adiv, Deterniining 3 4 motion an d structure from optical flowsgenerated by several moving objects, IEEE !runs. Pattern Anal-ysis and Mach. Intell., 7 , pp . 384-401, 1985.P. A nandan, A unified perspective on com putat ional techniquesfor the measurement of visual motion, First Intr. Conf. on Com-puter Vision, pp . 219-230, 1987.C. H . Anderson, P. J. Burt , and G. S. van d er W al, C%ange de-tection and tracking using pyrainid transform techniques, Proc.SPIE Conf. on Intell. Robots and Computer Vtsaon, Boston,S.T. Bar nard , Stereo match ing by hierarchical, microcanonicalannealing, Proc. DAR PA Image Understanding Workshop, pp .592-797, 1987.S.T . Barnard and W . B. Thompso n, Dispari ty analysisof mages,IEEE T m n s . Pattern Analysis and Machtne Intelligence, 2, pp .333-340, 1980.J . R. Bergen and E. H. Adelson, Hierarchical, coniputationallyefficient motion estimation algorithm, J . Opt. Soc. Am. A , 4,pp .35, 1987.

pp . 72-78, 1985.

181 P. J. Burt Fast filter transforms for iinage processing, ComputerVision, Graphics, and Image Processing, vol. 16 , pp . 20-51, 1981.[9] P. J . Burt, A family of pyramid structures for niultiresolutionimage processing, submitted, 1989.[lo] P. J. Burt and G . S. van Der W al, Iconic image analysis w ithinthe Pyraniid Vision Machine, IEEE Workshop on Coinpuler A r -chitecture for Pattern Analysis and Machine Intelligence, Seatt le,1987.[ l l ] P. J . Burt , C. Yen, and X. Xu, M ultiresolution flow-through 1110-tion analysis, IEEE Computer Vision and Pattern RecognitionConf. Proceedings, Washington D. C., pp . 246-252, 1983.[12] J. Dengler, Local niotion estimation with the dynoinic pyramid,Eighth International C onf. on Pattern Recognition, Paris , pp.1289-1292, 1986.[13] C. L. Fennenia and W. B.Thonipson , Velocity determin ation inscenes containing several nioving objects, Cotnputer Graphrcs a n dImage Processing, 0, pp . 301-315, 1979.[14] D. J . Heeger, Optica l flow from spatiote mp oral filters, First h t r .Conf. on Computer Vision, pp . 181-190, 1987.[15] B. K . P. Horn and B.G. Schunck, Determining optical flow, A r-tificial Intelligence, I T , pp . 185-203, 1981.[16] J . S. Lee and C. Lin, A novel approach to real-time motion de-tection, Proceedings of the Conference on Computer Vision andPattern Recognition, Ann Arbor, 1988.[17] J. 0 . Limb and J . A. Murphy, Estimating the velocity of mov-ing images in television signals, Conipuler Graphics and ImageProcessing, 4 , pp . 311-327, 1975.[ la ] B. D. ucas and T. Icanade, An iterative image registration tech-nique with au applicat ion to stereo vision, Proreedings DARPAImage Ihderstanding Workshop, pp . 121-130, 1981.[19] D. M. Martinez, Model-based motion estimation and its appli-cat ion to restorat ion and interpretat ion of motion pictures. PhDthesis , MIT, 1986.[ZO] H. Rom, S. Peleg, and D. Keren, Motion based segmentation,submi t t ed , 1989.[21] L. Qu am , Hierarchical warp stereo, Proc. DARPA Image Iln-derstanding Workshop, New Orleans, pp. 149-155, 1984.(221 H. Shvaytser, J. R. Bergen, R. Hingorani and J. Lnbin, A ro-bust a nd efficient algorithm for coniputing optical flow, subm itted,1989.(23) G. S. van der Wal and J . 0. Sinniger, Real time pyramid transformarchitecture, Proc. Intelligent Robots and Comp. Vision, Boston,[24] J. P. H. van Santen and G. Sperling, Temporal covariance inodel

of human motion perception, J . O p t . Soc. A T J I 1, pp. 451-473,1984.1251 A . M. Waxman, J. Wn and F. Bergholm, Convected activationprofiles and the measurement of visual motion, Proceedings of theConference on Computer Vision and P attern Recognition, Ann Ar-bor , 1988.[26) R. Y. Wong and E. L. Hall, Seqnential hierarchical scene match-ing, IEEE Trans. Computers, 4,pp . 359-366, 1978.

.

pp . 300-305, 1985.

AcknowledgmentWe would like to thank Richard Sinis of th e US Arniy Missile Com-man d for providing the hel icopter images used in our second example.

I2

tracking with camera

Documents