this is an open access document downloaded from orca ... · • r.r. martin is with the school of...

12
This is an Open Access document downloaded from ORCA, Cardiff University's institutional repository: http://orca.cf.ac.uk/47331/ This is the author’s version of a work that was submitted to / accepted for publication. Citation for final published version: Lu, Shao-Ping, Zhang, Song-Hai, Wei, Jin, Hu, Shi-Min and Martin, Ralph Robert 2013. Timeline editing of objects in video. IEEE Transactions on Visualization and Computer Graphics 19 (7) , pp. 1218-1227. 10.1109/TVCG.2012.145 file Publishers page: http://dx.doi.org/10.1109/TVCG.2012.145 <http://dx.doi.org/10.1109/TVCG.2012.145> Please note: Changes made as a result of publishing processes such as copy-editing, formatting and page numbers may not be reflected in this version. For the definitive version of this publication, please refer to the published source. You are advised to consult the publisher’s version if you wish to cite this paper. This version is being made available in accordance with publisher policies. See http://orca.cf.ac.uk/policies.html for usage policies. Copyright and moral rights for publications made available in ORCA are retained by the copyright holders.

Upload: others

Post on 23-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

This is an Open Access document downloaded from ORCA, Cardiff University's institutional

repository: http://orca.cf.ac.uk/47331/

This is the author’s version of a work that was submitted to / accepted for publication.

Citation for final published version:

Lu, Shao-Ping, Zhang, Song-Hai, Wei, Jin, Hu, Shi-Min and Martin, Ralph Robert 2013. Timeline

editing of objects in video. IEEE Transactions on Visualization and Computer Graphics 19 (7) , pp.

1218-1227. 10.1109/TVCG.2012.145 file

Publishers page: http://dx.doi.org/10.1109/TVCG.2012.145

<http://dx.doi.org/10.1109/TVCG.2012.145>

Please note:

Changes made as a result of publishing processes such as copy-editing, formatting and page

numbers may not be reflected in this version. For the definitive version of this publication, please

refer to the published source. You are advised to consult the publisher’s version if you wish to cite

this paper.

This version is being made available in accordance with publisher policies. See

http://orca.cf.ac.uk/policies.html for usage policies. Copyright and moral rights for publications

made available in ORCA are retained by the copyright holders.

Page 2: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 1

Time-Line Editing of Objects in VideoShao-Ping Lu, Song-Hai Zhang, Jin Wei, Shi-Min Hu, Member, IEEE, and Ralph R Martin

Abstract—We present a video editing technique based on changing the time-lines of individual objects in video, which leavesthem in their original places but puts them at different times. This allows the production of object level slow motion effects, fastmotion effects, or even time reversal. This is more flexible than simply applying slow or fast motion to whole frames, as newrelationships between objects can be created. As we restrict object interactions to the same spatial locations as in the originalvideo, our approach can produce high-quality results using only coarse matting of video objects. Coarse matting can be doneefficiently using video object segmentation, avoiding tedious manual matting. To design the output, the user interactively indicatesthe desired new life-spans of objects, and may also change the overall running time of the video. Our method rearranges thetime-lines of objects in the video whilst applying appropriate object interaction constraints. We demonstrate that although thisediting technique is somewhat restrictive, it still allows many interesting results.

Index Terms—Object-level motion editing, Foreground/background reconstruction, Slow motion, Fast motion.

1 INTRODUCTION

V ISUAL special effects can make movies moreentertaining, allowing the impossible to become

possible, and bringing dreams, illusions, and fantasiesto life. Special effects are an indispensable post-pro-duction tool to help convey a director’s ideas andartistic concepts.

Time-line editing during post-production is an im-portant strategy to produce special effects. Fast-motionis a creative way to indicate the passage of time.Accelerated clouds, city traffic or crowds of peopleare often depicted in this way. Slowing down a videocan enhance emotional and dramatic moments: forexample, comic moments are often more appealingwhen seen in slow-motion. However, time-line edit-ing is normally applied to entire frames, so that thewhole scene in a section of video undergoes the sametransformation of time coordinate. Allowing time-line changes for individual objects in a video has thepotential to offer the director much more freedom ofartistic expression, and allows new relationships be-tween objects to be constructed. Several popular filmshave used such effects: e.g. characters move whiletime stands still in the film ‘The Matrix’. Usually, sucheffects are captured on the set.

The time-lines of individual objects in video may bechanged by cutting, transforming and pasting themback into the video during post-production. This re-

• S.-P Lu, S.-H Zhang, J. W and S.-M Hu are with Tsinghua Na-tional Laboratory for Information Science and Technology, Departmentof Computer Science and Technology, Tsinghua University, Beijing100084, China.E-mail: [email protected], [email protected], [email protected], [email protected].

• R.R. Martin is with the School of Computer Science and Informatics,Cardiff University, Cardiff, Wales CF24 3AA, UK.E-mail: [email protected].

quires fine-scale object matting, as in general newobject interactions may occur within the space-timevideo volume: objects may newly touch or overlapin certain frames, or an occlusion of one object byanother may no longer happen. Typical video objectsegmentation and composition methods still need in-tensive user interaction, especially in regions whereobjects interact. On the other hand, automatic or semi-automatic tracking approaches, such as mean shifttracking [1], particle filtering [2] and space-time op-timization [3], can readily provide coarse matting re-sults, e.g. a bounding ellipse that contains the trackedobject together with some background pixels. We takeadvantage of such methods to provide an easy-to-useobject-level time-line editing system. Our key idea isto retain and reuse the original interactions betweenmoving objects, and also the relationships betweenmoving objects and the background. In particular,moving objects and are kept at their original spatiallocations, as are the original object interactions, butthese may occur at a different time. The user canspecify a starting and ending time for the motion ofeach object, and a speed function (constant by default)in that interval, subject to these constraints.

We use an optimization method to adjust videoobjects’ time-lines to best meet user specified re-quirements, while satisfying constraints to keep objectinteractions at their original spatial positions. Thisoptimization process is fast, taking just a second toperform for a dozen objects over 100 frames, allowinginteractive editing of video. The user can clone objects,speed them up or slow then down, or even reversetime for objects, to achieve special effects in video.

2 RELATED WORK

Video editing based on object matting and composit-ing is often used in digital media production. Schodl

Page 3: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 2

Fig. 1. Time-line editing of video objects puts them in the same places but at different times, resulting in newtemporal relationships between objects. Top: original video, bottom: with slower cat. Left: trajectories of the catand the woman in the video.

and Essa [4] extract video objects using blue screenmatting, and generate new videos by controlling thetrajectories of the objects and rendering them in arbi-trary video locations. Video Matting and compositingentail a tedious amount of user interaction to extractobjects, even for a short video [5], [6], [7], [8]. Avariety of approaches can be used to alleviate this,such as contour tracking [9], optical flow assistedBayesian matting [10], 3D graph cut [11], mean shiftsegmentation [12], and local classifiers [13]. Even so,current methods still require intensive user interactionto perform accurate video object matting, and cannothandle objects without clear shape boundaries suchas smoke, or objects with motion blur. Although patharrangement has been extensively considered in 3Danimation, such as group motion editing [14], it can-not be directly used in video object path editing dueto the difficulty of object extraction and compositing.In contrast, our system avoids the need for accurateobject matting as moving objects are always placedat their original locations, albeit at different times,finessing the compositing problem. Even a boundingellipse determined by straightforward tracking of theobject can provide adequate matting results.

Various approaches have been devised to providetemporal analysis and editing tools for video. Forexample, video abstraction [15], [16] allows fast videobrowsing by automatically creating a shorter versionwhich still contains all important events. A commonapproach to representing video evolution over time isto use key frame selection and arrangement [17], but,being frame-based, it does not permit manipulationof individual objects. Video montage [18] is similar tovideo summarization, but extracts informative spatio-temporal portions of the input video and fuses themin a new way into an output video volume. Barnes etal. [19] visualizes the video in the style of a tapestrywithout hard borders between frames, providing spa-tial continuity yet also allowing continuous zoomsto finer temporal resolutions. Again, the aim is auto-

matic summarization, rather than user-controlled edit-ing. Video condensation based on seam carving [20],[21], [22] is another a flexible approach for removingframes to adjust the overall length of a video. Theabove methods generally handle information at theframe, or pixel level, whereas our tools allow the userto modify objects, which is a more flexible way torearrange video content.

Goldman et al. [2] advocate that object tracking, an-notation and composition can lead to enriched videoediting applications. [23], [24] enable users to navigatevideo using computed time-lines of moving objectsand other interesting content. Liu et al. [25] presenta system for magnifying micro-repetitive motions bymotion based pixel clustering and inpainting. Recentwork [26] provides a video mesh structure to interac-tively achieve depth-aware compositing and relight-ing. Scholz et al. [27] present a fine segmentation andcomposition framework to produce object deforma-tion and other editing effects, allowing spatial changesof objects; it requires intensive user interaction as wellas foreground extraction and 3D inpainting. Rav-Achaet al. [28], [29] introduce a dynamic scene mosaicediting approach to generate temporal changes, but,to avoid artifacts, require that the moving objectsshould not interact. Our method analyzes and modelsobject interactions, and avoids artifacts in the outputby constraining the kinds of editing allowed. Wefurther use a visual interface to efficiently manipulateobjects’ life-spans and speeds in the video volume.

Object-based video summarization methods alsoexist, rearranging objects into an image or a shortvideo in the style of a static or moving storyboard[30], [31]. Goldman et al. [32] presents a method forvisualizing short video clips as a single image, usingthe visual language of storyboards. Pritch et al. [33]shorten an input video by simultaneously showingseveral actions which originally occurred at differenttimes. These techniques represent activities as 3D ob-jects in the space-time volume, and produce shortened

Page 4: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 3

Fig. 2. Pipeline. Preprocessing includes panoramic background construction and moving object coarseextraction. Video tubes are constructed for each extracted object. The user specifies starting times and speedfor edited objects. Trajectories of video objects are optimized to preserve original interactions between objects,and relationships to the background. Resampling of objects from suitable frames produces the output.

output by packing these objects more tightly along thetime axis while avoiding object collisions. These ap-proaches shift object interactions through time whilekeeping their spatial locations intact to avoid visualartifacts. We use the same approach of keeping objectinteractions at the same spatial locations, optimizingobjects’ locations in time to best meet the user’srequirements whilst also satisfying constraints. Ourmethod not only allows video to be condensed, butalso allows the user to determine the life-spans ofindividual objects, including changing their startingtimes, speeding them up or slowing them down, oreven reversing their time-lines.

3 APPROACH

Our system allows the user to edit objects’ time-lines,and produces artifact-free video output while onlyneeding coarse video object matting (see the bottomof Figure 4). Our approach relies on reinserting eachobject in the output at the same place as before,with the same background, but at a different time.Carefully handling object interactions is the key to ourapproach. When one object partly or wholly occludesanother, or their coarse segmentations overlap, anychanges to their interaction will prevent direct com-position of these objects with the background. Wethus disallow such changes: output is produced usingan optimization approach which imposes two hardconstraints, while also best satisfying the (possiblyconflicting) user requests:

• Moving objects must remain in their original spa-tial positions (and orientations) and can only betransformed to a new time.

• Interacting objects (i.e. objects which are veryclose or overlapping) must still interact in thesame way, at the same relative speed, althoughmaybe at a different time.

• The user may specify new starting and endingtimes for objects, as well as a speed functionwithin that duration; weights prioritise such re-quirements for different objects.

• Certain frames for an object may be marked asimportant, with a greater priority of selection inthe output.

• The user may lengthen or shorten the entire video.

We allow the background to move (pan), in whichcase keeping objects and their interactions at thesame spatial positions does not mean at the samepixel coordinates, but at the same location relative toa global static background for the scene. Thus, ourmethod builds a panoramic background for all framesand coarsely extracts tubes representing the spatio-temporal presence of each moving object in the video.Object extraction is performed using an interactivekeyframe-interpolation system, which coarsely deter-mines a bounding ellipse in each frame for each mov-ing object, forming a tube in video space-time. Afterdetecting all bounding ellipses in each frame, SIFTfeatures are extracted from the remaining backgroundin each frame, and we follow the approach in [34] toregister frames to generate the panoramic backgroundimage. We extract SIFT features for all images and alsooptical flow as the correspondence between adjacentframes, then RANSAC is used to match all frames toa base frame and compute a perspective matrix foreach frame We then use the homography parametersobtained from the above approach to transform eachframe and its bounding ellipses to global coordinates.The bounding ellipses are labled and adjusted onkey frames by the user, and they can be manuallyadjusted by inaccurately interpolated in other framesor even transformed by the homography parameters.To perform coarse matting, we directly extract thedifference between each aligned image and the back-ground image inside each object’s ellipse to producethat object’s alpha information for the current frame.

Having determined the background and movingobjects, video object trajectory rearrangement involvestwo steps: adjusting the shapes of the video tubeswithin the video volume, and resampling the tubes atnew times. The basic principle used in the first stepis that all objects should follow their original spatial

Page 5: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 4

Fig. 3. The video volume. Top-left: original trajectories,with start and end of an interaction marked by greencircles. Bottom-left: object trajectories are subdividedinto sub-tubes at these points. The second sub-tube inthe purple trajectory interacts with the gold trajectory.Right: trajectories mapped onto x-y and x-t planes.Note that the red circle is not a real interaction event.

pathways but at different times to before. Initially,the user sets a new time-line for each object to bechanged, including its starting and ending time, andits speed function. These user-selected time-lines maycause conflicts with the interaction constraints, so weoptimize the time-lines for all objects to best meetthe users input requests while strictly preserving thenature and spatial locations of object interactions. Wealso take into account any change in overall videolength. This optimization may be weighted to indicatethat some tubes are more important than others. Theresult is a new video tube for the optimized life-spanof each object; this may be shorter or longer than theoriginal. The new video tubes are now resampled, andstitched with the background to produce the overallresult. Resampling is done by means of a per-objectframe selection process which also takes into accountthat user-prioritized frames should be preferentiallykept. As objects still appear in their original spatiallocations, and have the same relationships with thebackground and other objects (but at different times),the main visual defects which may arise are due toillumination changes over time. Alpha matting withillumination adjustment overcomes this problem to alarge degree.

We now give some notation. We suppose the in-put video has N frames, and the chosen number ofoutput frames is M . The spatio-temporal track of amoving object is a tube made up of pixels with spatialcoordinates (x, y) in frames at time t. See the top-left of Figure 3. The purple tube representing one

object interacts with another gold tube, green circlesmarking the beginning and end of the interaction.In such cases, we subdivide these tubes into sub-tubes at the start and end interaction points as shownat the bottom-left of Figure 3. A point at (x, y) inframe t which belongs to sub-tube i is denoted bypi(x, y, t). Sub-tubes are given an index i; all sub-tubes for a given object have consecutive indices. Forexample, if there were only 2 tubes, with 3 and 4parts respectively, their sub-tubes would have indices0, 1, 2 and 3, 4, 5, 6. We use tis and tie to represent thestart and end times of each sub-tube respectively. Thesecond sub-tube of the purple tube (see the bottom-left of Figure 3) is an interaction sub-tube shared withthe gold tube. Within such a sub-tube, both interactingobjects must retain their original relative temporalrelationship, so that they interact in the same way,and that the original frames still represent a validappearance for the interaction.

The top-right of Figure 3 shows all tubes mappedonto the x-y plane. Potential interaction points like thered circle are not an actual intersection in the videovolume, but have the potential to become one if objecttubes were adjusted independently. When optimizingthe new object tubes, we do so in a way whichminimizes the possibility of a potential interactionbecoming an actual interaction.

Our video editing process rearranges object time-lines using affine transformations of time for each sub-tube. First, however all user-defined speed functionsare applied as pre-mapping functions which adjustthe trajectories of the tubes while keeping the life-spans. Thus, each sub-tube i is transformed to a newoutput sub-tube i′ given by

pi′(x, y, t) = pi(x, y, Ait+Bi), (1)

where Ai determines time scaling and Bi determinestime shift. We find Ai and Bi for each sub-tubeby seeking a solution which is close as possible tothe user’s requests for time-line modification whilemeeting the hard constraints.

3.1 Optimization

Determining appropriate affine transformations oftime is done by taking into account the followingconsiderations; appropriate scaling is applied if theoverall video length is to be changed. These require-ments may conflict, so we seek an overall solutionwhich is the best compromise to meeting them all. Forbrevity, we ignore cases in which objects are reversedin time; these can easily be handled by straightfor-ward modifications.

Duration: Durations of life-spans should remainunchanged for unedited objects. Edited objects shouldhave new life-spans with lengths as close as possibleto those specified by the user.

Page 6: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 5

Fig. 4. Original interaction preservation. Left, cen-ter: the sergmented regions for two cars, taken fromthe original frames, shown on the global background.Right: composed result containing both cars simultane-ously. Note that their original interaction is preserved.Bottom: corresponding matting results.

Temporal location: The life-spans of unedited ob-jects should start and end as near as possible to theiroriginal starting and ending times, with time pro-gressing uniformly between start and finish. Editedobjects should start and end as close as possible touser specified times, with time progressing uniformlyin between. For objects with a user specified speedfunction, the new space-time distribution of the tubeis applied after mapping the original tube by thespeed function.

To meet the first requirement, we define an energyterm ED(i) whose effect is to enforce the appropriatelife-span for each sub-tube:

ED(i) = ||Hi −AiLi

M||. (2)

Here Hi is the desired life-span of sub-tube i, and Li =(tie − tis) is its original life-span. For edited objects,Hi is given by the desired life-span of its parent tubeas determined by the user, while it is set to MLi/Nfor unedited objects.

To meet the second requirement, we define anenergy EL(i) which penalizes moving sub-tube i fromits desired position in time; we do so by consideringthe time at which each frame of the object shouldoccur:

EL(i) =1

(tie − tis)

tie∑

t=tis

||(Ait+Bi

M−

t

N)||. (3)

(This equation can be simplified to avoid explicitsummation).

These two terms are combined in an overall energyfunction to balance these requirements; a per-tubeweight wi allows the user to indicate the importanceof meeting the requirements for each subtube—tubeswith higher weight should more closely meet theuser’s requirements:

E =∑i

wi(λED(i) + EL(i)), (4)

Fig. 5. New interaction prevention. Top: if new inter-actions are prevented, cars’ tubes remain separate.Bottom: otherwise, unwanted interactions may arisebetween tubes, resulting in blue and white cars over-lapping in an unrealistic manner.

where λ controls relative importance of these require-ments, and is set to 2.5 by default.

Before we can minimize this energy, we also applyseveral hard constraints, as described shortly. Thisleads to a non-linear convex optimization problemwhose unknowns are Ai and Bi. We use CVX [35]to efficiently find an optimal solution.

3.2 Constraints

When minimizing the energy, several constraints mustbe imposed in addition to those described earlier.

Affine parameters: Affine parameters can only takeon certain values if each object is to fit into thetarget number of output frames. Temporal scaling andshifting parameters should thus satisfy

max(N,M)

Li

≥ Ai ≥ 0,

max(N,M) ≥ Bi ≥ min(−N,−M).

Tube continuity: Consecutive sub-tubes belongingto the same tube must remain connected. Continuitybetween the end of sub-tube i and the start of sub-tube j requires

Aitie +Bi + 1 = Aj(tie + 1) +Bj . (5)

Original interaction preservation: To preserve orig-inal interaction points at which different objects startto interact, relevant object sub-tubes for the interactingobjects must connect in space-time. If one object’s sub-tube i starts at time tk and interacts with sub-tube jof another object trajectory, preservation of the initial

Page 7: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 6

interaction point under the affine transformation re-quires that

Aitk +Bi = Ajtk +Bj . (6)

An example preserving interaction between two carsis shown in the top row of Figure 4.

New interaction prevention: To prevent potentialinteraction points between different object tubes frombecoming real interaction points, we should ensurethat different objects go through them at differenttimes. These times should be sufficiently distinct thatwe can rely on coarse object matting when placingobjects in the final output. We thus impose the fol-lowing temporal separation constraint. If sub-tube iand sub-tube j share a potential interaction point, werequire

||(Aiti +Bi)− (Ajtj +Bj)|| > ε (7)

where ti and tj are the corresponding times for eachobject. In our implementation, we set ε to between5 and 10, taking into account both the sizes of theobjects and the speeds at which they are moving,ensuring at least 5 frames separation as a safetymargin (as the objects may be sampled differently—see Section 3.3). Figure 5 shows the undesirable resultsthat can happen if this constraint is not added.

3.3 Object Resampling

Having determined the affine temporal scaling foreach sub-tube, we separately resample each trans-formed sub-tube to give the object’s appearance ineach output frame. We use uniform resampling withuser defined weighting curves to produce the out-put frames. While interpolation between appropriateinput samples would seem an alternative and perhapsmore intuitive solution, it can introduce artifacts forvarious reasons, as explained in e.g. [36], and moreimportantly, is incompatible with our coarse mattingapproach.

Clearly, output samples will not necessarily fallprecisely at transformed input samples. Thus, insteadwe select input frames for each object to generate thefinal output. Given a sequence of input samples, andtimes at which output samples are required, we usethe input sample occurring at the time nearest to thatof the desired output sample. Other work has alsoused frame based resampling [31], [37].

3.4 User Interface

Our interface offers various controls for object time-line editing (see the supplemental video), includingspecifying the start and end time (the tis and tidvalues), the life-span (Hi), and graphical entry of thespeed function for individual moving objects (resam-pling weighting), as well as overall video length (M ).A default resampling weight of 1 is used; values areallowed to range from 0 to 5. The user may edit

TABLE 1Performance of the system.

Video Clip Fig. 1 Fig. 6 Fig. 7 Fig. 8 Fig. 9Frames 150 270 240 160 230Width 960 720 960 1280 640Height 540 560 540 720 480Objects 2 5 4 10 3Sub-tubes 5 7 8 16 10Preprocessing 225s 570s 300s 760s 695sOptimization 0.48s 0.71s 0.66s 0.85s 0.7s

the weights using the speed curve to give desiredresampling weights for each frame. The user mayalso specify time reversal for objects, object cloningand object deletion. The interface also allows sub-tubes and output frames to be marked as havinggreater importance (wi) in the output. During editing,unedited objects which interact with edited objectsare also adjusted to ensure interactions are preserved.For example, if the user clones an object, any otherinteracting objects will also be cloned. If this is notdesired, the user should carefully limit the portion ofthe object’s life which is cloned to that part withoutinteractions with other objects.

3.5 Performance

Our goal is to provide a system allowing objects tobe extracted without laborious interaction, and whichprovides real-time interactive control and visualiza-tion of the editing results. The most time consum-ing step in our system is the preprocessing used toconstruct the panoramic background image and inter-actively extract the moving objects. The user marksa bounding ellipse on key frames for each movingobject; the system takes about 0.3s per frame to trackeach object in a complex scene. Background construc-tion takes 0.1s per frame. Although the preprocessingneeds hundreds of seconds, it is executed just once,and after that the user is free to experiment with manydifferent rearrangements of the video. As shown inTable 1, once the user has set parameters, optimizationtakes under 1 second in our examples, mostly to solvethe energy Equation 4 to find optimal object rear-rangements. This time depends mainly on the numberof constraints, which in turn is determined by thenumber of object interactions. Overall, we can readilyachieve realtime performance for user interaction ona typical PC with an Intel 2.5Ghz Core 2 Duo CPUand 2GB memory. To further improve performance,we merge video sub-tubes with start or end pointsless than 5 frames apart, marking the result as anintersection sub-tube. Table 1 shows timings for allexamples in the paper, with numbers of objects andcorresponding relation constraints.

4 RESULTS AND DISCUSSION

We have tested our video editing algorithm withmany examples, producing a variety of visual effects

Page 8: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 7

Fig. 6. Video editing. Top row: input video. Middle row: summarization into half the original time. Bottom row:editing to fast forward and then reverse the blue car. Frames are shown at equal time separations in each case.Note that lifespans of unedited objects are preserved to the exent possible.

Fig. 7. Video object rearrangement. Top right: order of appearance 4 penguins in the original video. Bottom right:output frames after interactively rearranging the penguins to appear simultaneously. Left: corresponding tubes.

which demonstrate its usefulness, and that would bedifficult or tedious to obtain by other means. Figures1 and 8 show examples of object level fast and slowmotion effects and the corresponding original videos.In Figure 1 the cat moves slower. A faster cat is shown

in the supplemental material. In order to preserve theoriginal spatial location of the interaction between thecat and the girl (see the fourth column), the fastermoving character enters the scene at a later time. InFigure 8, we have shortened the entire video and in-

Page 9: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 8

teractively rearranged the cars to have approximatelythe same speed, and to be approximately equallyspaced. In Figure 7 we have moved the life-spansof the penguins to make them appear in the scenesimultaneously. Figure 9 shows object cloning andreversal examples in which we make three copies ofa girl on a slide; we also make the girl go up theslide rather than down. Unlike [30], which leads toghosting, our method automatically arranges tubes toprevent object overlaps. The videos are available inthe supplement.

While such applications are the main target ofour approach, other less obvious effects can also beachieved. For example, we can selectively shorten avideo to a user-specified length (see Figure 6) bysetting object lifetimes to either their original lengths,or the desired total length, whichever is smaller. Ourapproach differs from previous approaches to videosummarization which either produce a summary noshorter than the longest lifetime of a single object, or,for shorter results, unnaturally cut the video tubes forobjects and move parts of them in time (e.g. Pritch’smethod in [33]). Instead, we can speed objects up toreduce the overall time. The bottom row of Figure 6further shows object fast forward, reverse motion andobject duplication, as well as local speed editing.

As Figure 5 shows, if object inter-relationships areignored when moving objects in time, unwanted over-laps may arise between objects originally crossing thesame region of space at different times. By preventingnew object interactions, we avoid such collisions be-tween object trajectories in the output video. Unlike inRav-Acha et al’s method [28], [29], by preserving realinteractions in our algorithm we can effectively editobjects while avoiding visual artifacts at interactions.In Figure 6, more objects are shown per frame as theoverall video time is reduced. Objects follow theiroriginal paths, but spatial relationships are also wellpreserved. We also note that each sub-tube (not tube)is adjusted in terms of time scale and offset to meetthe users desired object time-lines. It would be ex-tremely difficult and tedious for the user to manuallyadjust time-lines so as to preserve existing interactionsand prevent new interactions, especially for multipleobjects.

We use ellipses for masking for simplicity of im-plementation and ease of manipulation. The user caneasily draw an ellipse in key frames to initiate objecttracking. Furthermore, when tracking is inaccuratein non-key frames, the user can quickly manuallycorrect the ellipse: a little additional user interactioncan overcome minor failures in tracking. This avoidsthe cost and difficulty of implementation of highlysophisticated tracking methods, which still are notguaranteed to always work. Exactly how coarse mat-ting is done is unimportant—the key idea is that ac-curate matting is not needed when the background isrobustly reconstructed and objects retain their original

locations.In practice, it is not always necessary to extract

all moving objects, provided that the ones of interest(e.g. football players) consistently occupy a differentspatial area of the video to the others (e.g. spectators),so that the two groups do not interact. In this casea moving background can be used. It too must beresampled if a different length video is required, usinga similar approach to that for object resampling.

5 LIMITATIONS

Although we have obtained encouraging results forvarious applications of our video object editing frame-work, our approach can provide poor quality resultsin certain cases. Our method is appropriate for videofor which a panoramic background can be readilyconstructed and video objects can be tracked (as in-dividuals, or suitable groups). In such cases, tempo-ral adjustment and rearrangement at the object-levelmakes it possible to produce special visual effects.Clearly, our system can fail if there is a failure to trackand extract foreground objects or the background.Less obviously, if the user places unrealistic or con-flicting requirements on the rearranged objects thismay result in an unsolvable optimisation problem;this may also happen if a scene is very complexand there is insufficient freedom meet all of a user’sseemingly plausible requests. Finally, if large changesoccur in the lifetimes of object sub-tubes, motion ofobjects in the output video may appear unnaturaldue to use of a frame-selection process. Differentchanges to lifetimes of adjacent subtubes may alsoresult in unnatural accelerations or decelerations. Wenow discuss these issues further.

Complex backgrounds and camera motions: Ourmethod may work poorly in the presence of back-ground change (e.g. in lighting, even if backgroundobjects remain static) and errors in background re-construction. Each frame in the output video includesmoving objects and the panoramic background, andtheir composition is performed according to the al-pha values obtained by coarse matting. We note thatthe coarse matting includes part of the backgroundas well as the moving object, so if the backgroundchanges more than a little, visible artifacts may arisedue to composition of the elliptical region with thebackground. Furthermore, good video compositionresults rely on successful background reconstruction.The panoramic background image is generated un-der an assumption of a particular model of cameramotion, which may not be accurate; even if it is, asingle static image may not exist for complex cameramotions, e.g. due to parallex effects. Our method thusshares common limitations with respect to handlingcomplex backgrounds and camera motions as severalother papers [27], [30]. Robust camera stabilizationand background reconstruction for more general cam-era motions are still challenging topics in computer

Page 10: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 9

Fig. 8. Video object rearrangement. Top: two frames of the original video. Bottom: two output frames afterinteractively rearranging the cars to be approximately equally spaced.

Fig. 9. Video object reversal and cloning. First row: girl on slide cloned. Second row: girl going up the slide.

vision [38], [39]. Our method can potentially be ex-tended to such cases given advances in those areas.

Time-line conflicts: Preserving original interactionsand preventing new ones are imposed as constraintsduring video editing. If the user manipulates objectsinconsistently, this may lead to visual artifacts; it mayeven lead to an unsolvable optimization problem ifthere are many complex interactions and insufficientfreedom to permit the desired editing operations.To avoid artifacts, and gain extra freedom, the usermay resolve conflicts by moving or trimming partsof objects’ time-lines to produce the desired result, oreven delete whole objects. For example, in the time

reversal example shown, we trimmed the last part ofthe girl’s tube to ensure the problem was solvable.

Implausible speeds: Our algorithm focuses on pre-serving interrelations between paths in the time di-mension. If many objects in the video have the poten-tial to intersect (i.e. to cross a shared spatial locationat different times), and certain objects are weightedfor preservation, other objects can suffer unnaturalaccelerations or temporal jittering. This could perhapsbe avoided to some extent by using a more sophis-ticated resampling method (as in [31]) or content-aware interpolation. The former would reduce, butnot completely eliminate, motion jitter. Simple 2D

Page 11: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 10

interpolation can often produce visual artifacts evenwith accurate motion estimation, as objects have 3Dshapes; such artifacts may be no more acceptable tothe viewer than minor motion jitter. An alternativesolution to alleviate such artifacts would introducemotion-blur (e.g. using simple box filtering of adjacentframes after realigning object’s centroids [40]) for fastmoving objects. Such cases could also be handledbetter if spatial adjustments were allowed as well astemporal ones during video tube optimization, butdoing so is incompatible with our framework basedon coarse segmentation and matting. Indeed, allow-ing spatial editing would give a much more flexiblesystem overall. Nevertheless, it may be possible to bea little more flexible without offering full generalityof spatial adjustment. If we were to restrict spatialchanges to locations with similar constant colouredbackgrounds for example, we might be able to stilluse coarse matting, perhaps with some combinationof video inpainting and graphcut matching to find anoptimal new location.

Our method also may produce unnatural resultsif the output video is excessively stretched (or com-pressed), due to the use of frame selection: frameswould be repeated, and motion would tend to jump.Again, an interpolation scheme of some kind couldovercome this issue. A further problem which mayarise is sudden changes in speed between adjacentsub-tubes, resulting in implausibly large accelerationsor decelerations. This could be solved by introducinga higher order smoothing term into our optimizationframework, or even by constructing a new speed-aware optimization scheme with larger freedom. Cur-rently, sub-tube motion rearrangement assumes anaffine transformation with a time shift and scaling,giving little freedom to edit the speed in the presenceof higher order constraints. Speed-oriented modelingcould be used to precisely edit the motion at frame-level, but would require more complicated user inter-action. In the current system, we have preferred sim-plicity over intricate control, but accept that for someapplications, detailed control would be desirable.

6 CONCLUSIONS

We have presented a novel realtime method for mod-ifying object motions in video. The key idea in ouralgorithm is to keep object interactions at the samespatial locations with respected to the backgroundwhile modifying the interaction times. This allows usto avoid the need for precise matting, reducing theneed for much tedious user interaction. We optimizeobject trajectories to meet user requests concerningtemporal locations and speeds of objects, while at thesame time including constraints to preserve interrela-tions between individual object trajectories.

ACKNOWLEDGMENTS

We would like to thank all the anonymous reviewersfor their valuable comments. This work was sup-ported by the National Basic Research Project of China(Project Number 2011CB302205), the Natural ScienceFoundation of China (Project Number 61120106007and 60970100). This work was also partly supportedby an EPSRC Travel Grant.

REFERENCES

[1] G. R. Bradski, “Computer vision face tracking for use in aperceptual user interface,” Intelligence Technology Journal, vol. 2,pp. 12–21, 1998.

[2] D. B. Goldman, C. Gonterman, B. Curless, D. Salesin, and S. M.Seitz, “Video object annotation, navigation, and composition,”in Proceedings of the 21st Annual ACM Symposium on UserInterface Software and Technology, Oct. 2008, pp. 3–12.

[3] X. L. K. Wei and J. X. Chai, “Interactive tracking of 2D genericobjects with spacetime optimization,” in Proceedings of the 10thEuropean Conference on Computer Vision: Part I, 2008, pp. 657–670.

[4] A. Schodl and I. A. Essa, “Controlled animation of videosprites,” in ACM SIGGRAPH/Eurographics symposium on Com-puter animation, 2002, pp. 121–127.

[5] S. Yeung, C. Tang, M. Brown, and S. Kang, “Matting andcompositing of transparent and refractive objects,” ACM Trans-actions on Graphics, vol. 30, no. 1, p. 2, 2011.

[6] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun, “A globalsampling method for alpha matting,” in IEEE Conference onComputer Vision and Pattern Recognition (CVPR). IEEE, 2011,pp. 2049–2056.

[7] Y. Zhang and R. Tong, “Environment-sensitive cloning inimages,” The Visual Computer, pp. 1–10, 2011.

[8] Z. Tang, Z. Miao, Y. Wan, and D. Zhang, “Video matting viaopacity propagation,” The Visual Computer, pp. 1–15, 2011.

[9] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Activecontour models,” International Journal of Computer Vision, vol. 1,pp. 321–331, 1988.

[10] Y.-Y. Chuang, A. Agarwala, B. Curless, D. H. Salesin, andR. Szeliski, “Video matting of complex scenes,” ACM Trans-actions on Graphics, vol. 21, pp. 243–248, Jul. 2002.

[11] Y. Li, J. Sun, and H.-Y. Shum, “Video object cut and paste,”ACM Transactions on Graphics, vol. 24, pp. 595–600, Jul. 2005.

[12] J. Wang, P. Bhat, R. A. Colburn, M. Agrawala, and M. F. Co-hen, “Interactive video cutout,” ACM Transactions on Graphics,vol. 24, pp. 585–594, Jul. 2005.

[13] X. Bai, J. Wang, D. Simons, and G. Sapiro, “Video snapcut:robust video object cutout using localized classifiers,” ACMTransactions on Graphics, vol. 28, pp. 70:1–70:11, Jul. 2009.

[14] T. Kwon, K. H. Lee, J. Lee, and S. Takahashi, “Group motionediting,” ACM Transactions on Graphics, vol. 27, no. 3, pp. 80:1–80:8, Aug. 2008.

[15] Y. Li, T. Zhang, and D. Tretter, “An overview of video ab-straction techniques,” HP Laboratory, Tech. Rep. HP-2001-191,2001.

[16] B. T. Truong and S. Venkatesh, “Video abstraction: A system-atic review and classification,” ACM Transactions on MultimediaComputing, Communications, and Applications, vol. 3, pp. 1–37,2007.

[17] C. W. Ngo, Y. F. Ma, and H. J. Zhang, “Automatic videosummarization by graph modeling,” in Proceedings of the NinthIEEE International Conference on Computer Vision, 2003, pp. 104–109.

[18] H. W. Kang, X. Q. Chen, Y. Matsushita, and X. Tang, “Space-time video montage,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2006, pp. II: 1331–1338.

[19] C. Barnes, D. B. Goldman, E. Shechtman, and A. Finkelstein,“Video tapestries with continuous temporal zoom,” ACMTransactions on Graphics, vol. 29, pp. 89:1–89:9, Jul. 2010.

[20] Z. Li, P. Ishwar, and J. Konrad, “Video condensation by ribboncarving,” IEEE Transactions on Image Processing, vol. 18, pp.2572–2583, 2009.

Page 12: This is an Open Access document downloaded from ORCA ... · • R.R. Martin is with the School of Computer Science and Informatics, Cardiff University, Cardiff, Wales CF24 3AA, UK

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, XYZ 201X 11

[21] K. Slot, R. Truelsen, and J. Sporring, “Content-aware videoediting in the temporal domain,” in Proceedings of the 16thScandinavian Conference on Image Analysis, 2009, pp. 490–499.

[22] B. Chen and P. Sen, “Video carving,” in Eurographics’08, ShortPapers., 2008.

[23] S. Pongnumkul, J. Wang, G. Ramos, and M. F. Cohen,“Content-aware dynamic timeline for video browsing,” inProceedings of the 23nd annual ACM symposium on User interfacesoftware and technology, 2010, pp. 139–142.

[24] T. Karrer, M. Weiss, E. Lee, and J. Borchers, “Dragon: A di-rect manipulation interface for frame-accurate in-scene videonavigation,” in Proceeding of the twenty-sixth annual SIGCHIconference on Human factors in computing systems, Apr. 2008,pp. 247–250.

[25] C. Liu, A. Torralba, W. T. Freeman, F. Durand, and E. H. Adel-son, “Motion magnification,” ACM Transactions on Graphics,vol. 24, no. 3, pp. 519–526, Jul. 2005.

[26] J. Chen, S. Paris, J. Wang, W. Matusik, M. Cohen, and F. Du-rand, “The video mesh: A data structure for image-basedvideo editing,” in Proceedings of IEEE International Conferenceon Computational Photography, 2011, pp. 1–8.

[27] V. Scholz, S. El-Abed, H.-P. Seidel, and M. A. Magnor, “Edit-ing object behaviour in video sequences,” Computer GraphicsForum, vol. 28, no. 6, pp. 1632–1643, 2009.

[28] A. Rav-Acha, Y. Pritch, D. Lischinski, and S. Peleg, “Evolvingtime fronts: Spatio-temporal video warping,” Hebrew Univer-sity, Tech. Rep. HUJI-CSE-LTR-2005-10, Apr. 2005.

[29] ——, “Dynamosaicing: Mosaicing of dynamic scenes,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 29,no. 10, pp. 1789–1801, Oct. 2007.

[30] C. D. Correa and K.-L. Ma, “Dynamic video narratives,” ACMTransactions on Graphics, vol. 29, pp. 88:1–88:9, Jul. 2010.

[31] E. P. Bennett and L. McMillan, “Computational time-lapsevideo,” ACM Transactions on Graphics, vol. 26, pp. 102 – 108,Jul. 2007.

[32] D. B. Goldman, B. Curless, D. Salesin, and S. M. Seitz,“Schematic storyboarding for video visualization and editing,”ACM Transactions on Graphics, vol. 25, pp. 862–871, Jul. 2006.

[33] Y. Pritch, A. Rav-Acha, and S. Peleg, “Nonchronological videosynopsis and indexing,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 30, pp. 1971–1984, Nov. 2008.

[34] M. Brown and D. G. Lowe, “Recognising panoramas,” in Pro-ceedings of the Ninth IEEE International Conference on ComputerVision, 2003, pp. 1218–1227.

[35] M. Grant and S. Boyd, “CVX: Matlab software for disciplinedconvex programming, version 1.21,” http://cvxr.com/cvx,Dec. 2010.

[36] Y. Weng, W. Xu, S. Hu, J. Zhang, and B. Guo, “Keyframebased video object deformation,” in International Conference onCyberworlds, 2008, pp. 142–149.

[37] K. Peker, A. Divakaran, and H. Sun, “Constant pace skimmingand temporal sub-sampling of video using motion activity,”in IEEE International Conference on Image Processing, 2001, pp.414–417.

[38] F. Liu, M. Gleicher, J. Wang, H. Jin, and A. Agarwala,“Subspace video stabilization,” ACM Transactions on Graphics,vol. 30, no. 1, p. 4, 2011.

[39] Z. Farbman and D. Lischinski, “Tonal stabilization of video,”in ACM Transactions on Graphics, vol. 30, no. 4. ACM, 2011,p. 89.

[40] A. Finkelstein, C. E. Jacobs, and D. H. Salesin, “Multiresolutionvideo,” in Computer Graphics (Proceedings of SIGGRAPH 96),1996, pp. 281–290.

Shao-Ping Lu is a Ph.D. candidate at De-partment of Computer Science and Tech-nology in Tsinghua University. His researchinterests include image and video process.He is a student member of ACM and CCF.

Song-Hai Zhang obtained his Ph.D. in 2007from Tsinghua University. He is currently alecturer of computer science at TsinghuaUniversity, China. His research interests in-clude image and video processing, geomet-ric computing.

Jin Wei is currently a Research Assistantat the Department of Computer Scienceand Technology, Tsinghua University. He re-ceived his MS degree in Computer Sciencefrom Tsinghua University and BS degreefrom Huazhong University of Science andTechnology. His research interests includedigital geometry modeling and processing,video processing and computational camera.

Shi-Min Hu Shi-Min Hu received the PhDdegree from Zhejiang University in 1996. Heis currently a professor in the Departmentof Computer Science and Technology at Ts-inghua University, Beijing. His research in-terests include digital geometry processing,video processing, rendering, computer ani-mation, and computer-aided geometric de-sign. He is associate Editor-in-Chief of TheVisual Computer(Springer), and on the edi-torial boards of Computer-Aided Design and

Computer & Graphics (Elsevier). He is a member of the IEEE andACM.

Ralph R Martin obtained his PhD in 1983from Cambridge University. Since then hehas been at Cardiff University, as Profes-sor since 2000, where he leads the VisualComputing research group. His publicationsinclude over 200 papers and 10 books cov-ering such topics as solid modeling, sur-face modeling, reverse engineering, intelli-gent sketch input, mesh processing, videoprocessing, computer graphics, vision basedgeometric inspection and geometric reason-

ing. He is a Fellow of the Learned Society of Wales, the Institute ofMathematics and its Applications, and the British Computer Society.He is on the editorial boards of “Computer Aided Design”, “ComputerAided Geometric Design”, “Geometric Models”, the “InternationalJournal of Shape Modeling”, “CAD and Applications”, and the “In-ternational Journal of CAD/CAM”.