raymond phan, richard rzeszutek and dimitrios androutsos · raymond phan, richard rzeszutek and...

Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy Minimization Techniques

Raymond Phan, Richard Rzeszutek and Dimitrios Androutsos Dept. of Electrical & Computer Engineering Ryerson University – Toronto, Ontario, Canada

Thursday, October 24th, 2012 Chinese 6 Theatre – Hollywood, California, USA

© 2012 SMPTE · e 2012 Annual Technical Conference & Exhibition · www.smpte2012.org

Outline of Presentation •  Introduction & Motivation • Conversion Framework for Images

–  Random Walks –  Graph Cuts / Depth Priors

• Conversion Framework for Video –  Keyframes to Label –  Label Tracking

• Results – Images and Videos


Introduction • Creating stereoscopic content from single-

view footage / 2D to 3D Conversion –  Huge surge of popularity: Converting legacy

content is very appealing –  Current accepted method is quite accurate, but

very labor intensive and difficult –  Known as rotoscoping: Manual operators extract

objects in a single frame to create left-right views –  Much research in 2D to 3D conversion performed

to alleviate difficulty, minimize time & cost


Introduction – (2) • Goal of 2D to 3D Conversion: Depth Map

–  B & W image showing depth at each point –  Depth maps are the main tool for conversion

•  Ultimate Goal: Automatic Conversion –  Most methods concentrate here –  Problem: Errors cannot be easily corrected –  May require extensive pre/post-processing

•  Solution? Semi-Automatic –  Happy medium between auto & manual


Motivation •  Semi-auto: Some user effort, rest is automatic

–  User marks objects/regions in the image on what is close/far from camera (dark/light intensities or colors)

–  For video: Mark several keyframes • Allow for label propagation from first frame minimizing user effort

–  Using above info, goal is to solve for the other depths in the entire image, or entire video

–  Results: Single depth map or a sequence of them •  How do we solve?

–  Using mixture of two semi-automatic segmentation algorithms: Random Walks & Graph Cuts


Conversion Framework – Images •  Random Walks: Energy Minimization Scheme

–  Starting from a user-defined label, what is the likelihood of a random walker visiting all unlabeled pixels in image

–  Goal: Classify every pixel to belong to one of K labels –  Pixel gets the label generating the highest likelihood

• Modify Random Walks to create depth maps –  Likelihoods are probabilities: Spans the set [0,1] –  User-defined depths and solved depths spans same set –  Goal is to solve for one label: The depth!


Conversion Framework – Images – (2) –  Use Scale-Space Random Walks in our framework

• Pyramidal sampling scheme, with Random Walks applied to each resolution à Merged via Geometric Mean

–  User chooses values from [0,1] and brushes over image –  0: Dark intensity / color, 1: Light intensity / color –  Resulting solved probabilities are directly used as depths

•  Is this valid? –  Yes! Psych. study done at Tel Aviv Uni. in Israel –  As long as the user is perceptually consistent in marking


Conversion Framework – Images – (3) •  Random Walks does have its issues though

–  Allows for internal depth variation, but for weak borders, results in a “bleeding effect”

–  Regions of one depth may leak into regions of another


Conversion Framework – Images – (4) •  Internal depth variation = Good!

–  Minimizes cardboard cutout effect –  RW generates depths not originally defined by user –  But, we need to respect object boundaries

•  Idea: Combine Random Walks with Graph Cuts –  Graph Cuts = Hard Segmentation

• Only creates result with depths/labels provided by user

–  GC solves the MAP-MRF problem with user labels –  Consider image as a weighted connected graph

•  Solution is to solve the max-flow/min-cut of this graph


Conversion Framework – Images – (5) • NB: Making depth maps = Segmentation Problem

–  Specifically, a multi-label segmentation problem –  But! Graph Cuts: Binary segmentation problem (FG/BG) –  Graph Cuts also has an integer labeling, not from [0,1]

• Must modify above to become multi-label –  Each unique user-defined label is given an integer label –  Binary segmentation is performed for each label

•  FG: Label in question, BG: All of the other labels

–  Rest of the pixels are those we wish to label


Conversion Framework – Images – (6) –  For each label, the maximum flow values are stored –  Label assigned to a pixel is the one with the highest flow

•  Problems? –  Object boundaries respected, but no depth variation


Conversion Framework – Images – (7) •  But! We can make use of this

–  Merge Random Walks & Graph Cuts together –  Create a depth prior: An initial depth map estimate –  This is essentially Graph Cuts! –  We merge by feeding depth prior as additional information into RW

•  Before we merge… –  Depth maps for RW and GC must be compatible with each other

•  RW has depths of [0,1], GC has integer labels

–  Map the user-defined labels from RW of set [0,1] to an integer set –  Perform Graph Cuts on using this integer set, and map the integer

set back to the set of [0,1] à Use a lookup table to do this


Conversion Framework – Images – (8) •  Summary of Method:

–  Place user-defined strokes on image & create depth prior –  Feed depth prior with same strokes into RW and solve –  To modify depth map, append strokes & re-run algorithm


Conversion Framework – Video •  Essentially the same as images, but we need to:

–  Mark more than one image / keyframe • Result will be a sequence of depth maps

–  Assume no abrupt changes in video •  If there are, separate manually, or use shot detection

• Also must be aware of memory constraints –  Intuitive to process each frame individually

•  Fits well with memory, and can compute depth maps in parallel • However, this breaks temporal relationship à Flickering

–  Ideal to process all frames simultaneously in memory •  But this will exhaust all available memory!


Conversion Framework – Video – (2) •  How do we solve?

–  Use block processing to preserve temporal coherency –  Process blocks of frames without exhausting memory –  Overlapping frames within blocks are left unused

•  Each block is independent of the others

•  Back to labeling: How many frames do we label? –  We allow the user the option of manually choosing

which ones to label –  However, labeling only a small set of frames will result in

depth artifacts


Conversion Framework – Video – (3) •  Example – From Sintel Sequence (http://www.sintel.org)

–  Shows what happens with 3 frames labeled & when all are labeled

Depth maps: 3 frames labeled only


Conversion Framework – Video – (4) Depth maps: All frames labeled

•  Labeling all frames is better –  Not doing this results in depth artifacts –  For frames having no labels, moving points quickly “fade” in depth –  Labeling all frames is better, but can be very time consuming!


Conversion Framework – Video – (5) •  Labeling all frames is better – Part II

–  Instead of manually labeling all frames, label first frame and use a tracking algorithm to propagate the labels

–  Adjust depths of labels when object depths change

•  Label Tracking? –  Would be very taxing to track all points in a stroke –  Decompose a stroke at a particular depth into N points –  Track each of these points separately –  Reconstruct the stroke using spline interpolation


Conversion Framework – Video – (6) •  Tracker used: Tracking-Learning-Detection (TLD)

–  Long term tracker for unconstrained video by Kalal et al. –  Simply draw a bounding box around object in first frame –  After, trajectory is determined for the rest of the frames,

accounting for size and illumination changes

•  How do we use this? –  For each point in each decomposed stroke, surround a

bounding box and track the region –  Reconstruct each stroke using the tracked points


Conversion Framework – Video – (7) • Modify TLD to account for object depth changes

–  Let S = {s0, s1, …, si, …, sM-1} represent the scales of initial bounding box drawn for stroke point

–  Create a mapping function correlating depth with scale à Smaller the bounding box, farther the depth

–  We know: s0/sM-1 à farthest/closest user-defined depth d0/dmax, & scale 1.0 is the depth of stroke, du

–  Assume parabolic relationship D(x) = ax2 + bx + c, & solve: ß x = the bounding box scale


Results - Images Single Frame from Avatar Trailer


Results – Images – (2) Shots of downtown Boston


Results – Videos – (1) Big Buck Bunny – User-defined frame #1 + tracked frames

Frame #1


Results – Videos – (2) Big Buck Bunny – Depth Maps

Frame #1


Results – Videos – (3) Shell-Ferrari-Partizan – User-defined frame #1 + tracked frames Frame #1


Results – Videos – (4) Shell-Ferrari-Partizan – Depth Maps Frame #1


Conclusions • Made a semi-auto method for 2D-3D conversion

–  Auto: Needs error correction & pre/post processing

–  Manual: Time-consuming and expensive –  Happy medium between the two –  Allows user to correct errors instantly and re-run fast

• Works for both images and video –  Merged two segmentation algorithms together

• Combine merits of both methods together for better accuracy

–  Video: Modified a robust tracking algorithm to track user-defined labels as well as dynamically adjust depths

The authors are solely responsible for the content of this technical presentation. The technical presentation does not necessarily reflect the official position of the Society of Motion Picture and Television Engineers (SMPTE), and its printing and distribution does not constitute an endorsement of views which may be expressed. This technical presentation is subject to a formal peer-review process by the SMPTE Board of Editors, upon completion of the conference. Citation of this work should state that it is a SMPTE meeting paper. EXAMPLE: Author's Last Name, Initials. 2011. Title of Presentation, Meeting name and location.: SMPTE. For information about securing permission to reprint or reproduce a technical presentation, please contact SMPTE at [email protected] or 914-761-1100 (3 Barker Ave., White Plains, NY 10601).

SMPTE Meeting Presentation

Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy

Minimization Techniques

Raymond Phan, B.Eng., M.A.Sc., E.I.T., Ph.D. Candidate Richard Rzeszutek, B.Eng., M.A.Sc., Ph.D. Candidate Dimitrios Androutsos, B.A.Sc., M.A.Sc., Ph.D., P.Eng., SMIEEE

Department of Electrical & Computer Engineering, Ryerson University, 350 Victoria St., Toronto, ON, M5B 2K3, Canada. [email protected], [email protected], [email protected]

Written for presentation at the SMPTE 2012 Annual Technical Conference – Advances in 3D Technology Track

Abstract. We present a method for semi-automatically converting unconstrained 2D images and video content into stereoscopic 3D. The user is presented with the image to convert, and brushes user-defined depth strokes in certain areas. These correspond to a rough estimate of the scene depths within these points. After, the rest of the depths are solved using this information, producing a depth map to create stereoscopic 3D content. For video, the user chooses several keyframes for brushing, and the depths for the entire video are found in a volumetric basis. Additionally for video, the user has the option of minimizing effort by employing a robust tracking algorithm, where the first frame only needs to be labeled. After, the labels are propagated throughout the entire video, ultimately increasing accuracy with more frames labeled. Our work combines the merits of two energy minimization techniques: Graph Cuts and Random Walks. The former respects boundaries, but does not have suitable depth diffusion, making the scene look like “cardboard cutouts”. The latter has good depth diffusion, but object boundaries are blurred. Therefore, combining the merits of both will lead to a higher quality result. Current efforts rely on automatic or manual conversion by rotoscopers. The former prohibits user intervention, while the latter is time consuming, prohibiting use in smaller studios. Semi-automatic is a compromise to allow for more faster and accurate conversion, decreasing the time for studios to release 3D content. The results shown in this paper generate good quality stereoscopic depth maps with minimal effort required.

Keywords. 2D to 3D Image Conversion, 2D to 3D Video Conversion, Random Walks, Graph Cuts, Depth Maps, Depth-Label Tracking, Stereoscopy, Semi-Automatic, Image Segmentation

Introduction Creating stereoscopic content from single-view footage, or 2D to 3D Conversion, has recently seen a surge in popularity. Most of the appeal comes from the ability of converting legacy material into its stereoscopic counterpart. However, the current accepted method for high quality conversion is a manually labour intensive process, commonly known as rotoscoping. Specifically, two novel views must be generated for each single frame or image, using information from the frame, or some frames before or after the current frame in the case of video, if required. An animator extracts objects from the frame, and manually manipulates them to create the left and right eye views. While producing very convincing results, it is difficult and time consuming, and will inevitably be quite expensive, due to the large requirement of manual operators. This is very prohibitive to all but the largest of studios, and thus makes conversion difficult for smaller studios, amateur film makers, and even consumers.

Despite these problems, 2D to 3D Conversion is very important to the stereoscopic post-processing process, and should not be dismissed. Natural stereoscopic filming is an option, but can become difficult and expensive, and converting single-view footage into stereoscopic 3D can become useful in cases where filming directly in 3D is too costly, or difficult. Research into conversion techniques is on-going in order to minimize this labour-intensive process. The goal for 2D to 3D Conversion is to create a depth map, a monochromatic image that describes how much depth each point in a frame has, and is used to create stereoscopic views [1]. The intensities in the image are directly proportional to how close those points are to the camera. The ultimate goal for conversion is to have a system that will automatically convert any 2D footage into 3D with minimal user input. Not only does this make the conversion more affordable, it allows for easier generation of 3D content.

Over the last few years, most methods focus on the automatic viewpoint to extract depth information from an image or frame. However, even when minimizing the amount of user intervention for faster conversion, these can become extremely difficult to control the results of the conversion. Also, any errors that occur in the process cannot be easily corrected. No provision is in place to correct objects that appear at the wrong depths while converting, and may require extensive pre-processing or post-processing to correct. Therefore, there are advantages to pursuing a user-guided, semi-automatic approach to 2D to 3D conversion, which is our method of focus. In this approach, for a single image or frame, the user simply marks objects and regions to what they believe is close or far from the camera, denoted by lighter and darker intensities respectively. The depth labeling does not need to be accurate; it only has to be perceptually consistent, and more often than not, the labeling of the user will meet this criteria [2]. The end result is that the depths for the rest of the pixels in the image are estimated using this information. For the case of video, the user marks certain keyframes, each in the same fashion as the single image. The end result is that the rest of the depths over the entire video are estimated. By transforming the process into a semi-automatic one, this will allow the user to correct depth errors that surface during the evolution of the algorithm, should they arise. This is the ultimate reason as to why we focus on semi-automatic methods: To allow for faster and more accurate 2D to 3D conversion, providing a more cost-effective and economical solution, and serving as a happy medium between complex methods, such as rotoscoping, to fully automatic ones.

Motivation The work by Guttmann et al. [2] is perhaps the best example of a semi-automated conversion method. With the user strokes and optimization, they arrive at a sequence of depth maps, one per frame. However, their system is quite complex, requiring many processing steps to obtain

Copyright © 2012 Society of Motion Picture and Television Engineers. All rights reserved.

the final depth map. Specifically, four types of equations describing piecewise continuity of the depth values spatially and temporally are used, and a linear system is solved. After, training and classification with a Support Vector Machine (SVM) classifier is used for the final depth values. We propose a stereoscopic processing chain that is more simpler, but demonstrate the same quality for depth map generation. For videos, depth maps for each frame in the sequence are generated, whereas for images, only a single depth map is produced. In our framework, we use a combination of two popular methods used for image segmentation: Graph Cuts [3][4] and Random Walks [5] to find an optimal labeling of the depths for the source images or videos, given an initial set of depth labels / strokes.

Conversion Framework Our framework relies on the user providing the system with an initial estimate of depth, where the user sparsely marks certain objects and regions as closer or farther from the camera. The user need not mark the entire frame, which would defeat the purpose of our framework. Closer and farther regions are denoted by darker and lighter regions respectively. Additionally, the user has the option of marking with different colors, so long as the colors range from darker to lighter. This is advantageous, in case the image or sequence to convert is monochromatic. The colors will help in eliminating confusion when marking the depths.

Random Walks for Images Random Walks [5] is an energy minimization scheme, starting at some label marked by the user at some pixel in the image, what the likelihood of a random walker visiting all of the unlabelled pixels in the image is. In an image segmentation viewpoint, the goal is to classify every pixel in an image to belong to one of K possible labels. Random Walks determines the probability of each pixel belonging to one of these labels, and the label that the pixel gets classified as is the one with the highest probability. This is performed by solving a linear system of equations, and we refer the reader to [5] for a full derivation of the method.

With Random Walks, for use in generating depth maps, we modify the methodology in the following fashion. The probabilities within Random Walks spans between the real set of [0,1]. We also allow the user-defined depths, and ultimately the depths needed to be solved for the rest of the image or frame to be from this set. The ultimate goal is to solve for only one label, which is the depth of the image or frame. As such, the user chooses values from the set of [0,1] to brush over the image or frame, where 0 represents a dark colour or intensity, and 1 represents a bright colour or intensity. The resulting solved probabilities can directly be used as the depths for generating stereoscopic 3D content. As a means of increasing accuracy, and to combat noise, we employ Scale-Space Random Walks (SSRW) [6], which samples the image using a multi-resolution pyramid. Random Walks is applied to each scale within the pyramid, upsampled, and are all merged using the geometric mean.

Unfortunately, using Random Walks has issues. A consequence with the above modification is that it is possible for depths to be produced that were not originally specified. This is certainly the desired effect, as this will allow for internal depth variation for objects, and that it would eliminate objects being perceived as “cardboard cutouts”. However, in terms of weak contrast, this results in a “bleeding” effect, where regions at a closer depth bleed, or merge, into regions of farther depth, even if these regions have a well defined border. We show an example of this in Fig. 4, which is a snapshot of an area on the Ryerson University campus. The image on the left shows the original image, the one in the middle shows user-defined scribbles, and the one on the right is the depth map produced by SSRW. As seen in the figure, the user only needs to mark a subset of the image, and a reasonable depth map is produced as a result. Though there is good amount of depth variation on the objects and the background, there is evidence of


bleeding around well-defined object boundaries. This is unlike Graph Cuts, and as we will see later, each pixel in Graph Cuts provides a hard segmentation, and the depth map consists of labels only provided by the user.

Figure 4. Generating a depth map using SSRW. Left: Original Image. Middle: User-defined

depths superimposed, Right: Resulting Depth Map. Depth Label Legend: Darker colors (red) – Far points. Brighter colors (orange and yellow) – Closer points. White pixels – Very close points.

Graph Cuts for Images Graph Cuts solves the Maximum-A-Posteriori Markov Random Field (MAP-MRF) labeling problem with user-defined constraints [3]. The solution to the problem is to find the most likely labeling for all pixels from these constraints. By considering the image as a weighted connected graph, the solution is the max-flow/min-cut of this graph [4]. Efficient algorithms and software have been created to perform this minimization, and we refer the reader to [4] for more details.

With the previous observations made, depth map generation in a semi-automatic approach can be considered as a multi-label classification problem. However, Graph Cuts is solely a binary classification problem, where each pixel is classified as one of two labels: foreground and background. Also, the labels in Graph Cuts come from an integer label set B ϵ [1,ND], where ND represents the total number of unique depths in the user-defined labeling. To perform depth estimation by Graph Cuts, each unique user-defined depth value is assigned an integer label from B. A binary segmentation is performed separately for each label b ϵ B. The user-defined labels having the label of b are assigned as foreground, while the other user-defined labels serve as background. The rest of the pixels are those we wish to label. Graph Cuts is run for a total of ND times, once for each label, and the maximum flow values for the graph are recorded for each label in B. If a pixel was only assigned to one label, then that is the label it is assigned. If a pixel was assigned multiple labels, we assign the label with the highest maximum flow, which is the least amount of energy required to classify the pixel. In some cases, even this will result in not every pixel being classified, but region filling methods can be used to correct this.

Fig. 5 illustrates a depth map example in the same style as Fig. 4, using the same labels in Fig. 4 to be consistent. As can be seen, only certain portions of the image are labeled, and a reasonably consistent depth map was generated successfully. It should be noted that there are some portions of the image that failed to be assigned any labels, such as the area around the right of the garbage can. However, one possibility to rectify this is by using region-filling methods, but we will demonstrate later that this is not needed. Though the result respects object boundaries well, the internal depths of each object, as well as some regions in the background do not have any internal depth variation, or depth gradients. If this was used for visualizing stereoscopically, the objects would appear as “cardboard cutouts”. Noting the merits of Random Walks, where depth variation within objects is desired, and combining this with the hard results of Graph Cuts, Random Walks should allow for depth variation to make objects more realistic, while Graph Cuts can eliminate the bleeding effect and respect hard boundaries.


Figure 5. Generating a depth map with Graph Cuts. Left: Original Image. Middle: User-defined depths superimposed. Right: Resulting Depth Map. Depth Label Legend: Darker colors (red) – Far points. Brighter colors (orange and yellow) – Closer points. White pixels – Very close points

The Use of Depth Priors To merge the two depth maps together to achieve greater accuracy, we introduce the notion of a depth prior [7]. This is an initial depth estimate, providing a rough sketch of the overall depth in the scene. Here, the depth prior is essentially the Graph Cuts depth map, and should help in maintaining the strong boundaries in the Random Walks depth map. Before we merge the two together, we must modify the depth prior. The depth map of Random Walks is in the continuous range of [0,1], while the depth map for Graph Cuts is in the integer range of [1,ND]. Both depth maps correspond to each other, but one needs to be transformed into the other so that they both can be compatible. As such, when the Graph Cuts depth map is generated, it is then processed through a lookup table T[k], where k is an integer label in the Graph Cuts depth map. The goal of the lookup table is to transform the integer labeling into a compatible labeling for Random Walks. This can simply be done by performing a labeling within the range of [0,1] for Random Walks, and sorting this list into a unique one in ascending order. Each depth in the list is assigned an integer label from [1,2,…,ND], keeping track of which continuous label corresponds to which integer label. When Graph Cuts has completed, this correspondence is used to map back to the continuous range. To finally merge the two depth maps together, this depth prior is fed into the Random Walks algorithm as an additional channel of information.

To briefly summarize, we first generate a depth prior. After, this information is fed into Random Walks to generate a final depth map, which is the one ultimately used for 2D to 3D conversion. Fig. 6 shows an example of the merged results. The left image shows the depth map generated by SSRW, the middle shows the depth map created by Graph Cuts. Finally, the right shows the merged depth map. When compared to the left and middle images, the merged result contains the most desirable aspects between the two. The depth prior has consistent and noticeable borders for the objects in the scene, while the Random Walks depth map contains subtle texture and gradients to those objects. The trees and shrubbery in the original image are now much more differentiated from the background and neighboring objects than before. In addition, the areas that were left unclassified in the depth prior, or the ”holes", have been filled in after the final depth map was produced without using any region-filling methods. This is because Random Walks ultimately considered the depth prior as an additional channel of information, and can be weighted accordingly, to emphasize the object boundaries more.

Conversion Framework for Video In our framework, we consider video as a special case. A single image is a video sequence that consists of one frame, and for video, it is essentially a sequence of images at a given frame rate. When generating depth maps for video, it is essentially the same, except that instead of


Figure 6. Merging the two frameworks together. Left – Depth map via SSRW. Middle – Depth

map via Graph Cuts. Right – Combined result using a depth prior / Graph Cuts.

labeling only one frame, several keyframes need to be labeled. The number of keyframes to be labeled is an issue that we will discuss later. In our framework, we are assuming that there are no abrupt changes in the video being processed. If a video has any camera changes, it must be split up into smaller portions where each portion does not have a shot change. This can be done either manually, or using an automated shot detection system.

There are some caveats that one must look out for when dealing with video. One of them is with the issue of memory. The most intuitive option is to process each frame independently, requiring little memory and the frames can be processed in parallel. Unfortunately, this breaks the temporal relationship and manifests as a flickering effect. Subtle changes between frames due to camera motion, or the position of the labels can result in different depths in regions of the frame that have not actually changed. The most ideal situation would be to process the entire volume in a single volumetric fashion, but does not bode well with memory. As a compromise, and to preserve temporal coherency, we use a block processing scheme, where frames are processed in overlapping blocks. The size of the block is large enough so the depth maps can be generated without completely exhausting all available memory. The overlapping frames are left unused since each block is treated as independent from one another. While this breaks the temporal relationship between blocks, the overlap minimizes the change between them. This provides a good balance between memory, temporal coherency and processing time.

Keyframes to Label Besides memory, the major problem is applying labels to keyframes. Specifically, one may ask how many frames are required for labeling in order to have accurate depth maps. We allow the user the option of labeling as many frames in the sequence as they wish. While it is possible to label a small set of frames to alleviate most of the work, this will result in a number of different depth artifacts. Fig. 7 illustrates a sample of frames, with their labels overlaid using a color scheme from the Sintel sequence (http://www.sintel.org). Fig. 8 illustrates some depth maps within the sequence, showing what happens when just the same frames from Fig. 7 are labeled. Finally, Fig. 9 illustrates when all frames within the sequence are labeled in the same style as Fig. 8. In both Fig. 8 and 9, from left to right, top to bottom are four frames from the sequence that were not marked. The color scheme for the labels is the same as in Figs. 4, 5 and 6.

Figure 7. Labeling for three frames in the Sintel sequence.


Figure 8. Depth map results for the Sintel sequence using only the frames of Fig. 7, for labeling.

Figure 9. Depth map results for the Sintel sequence when all frames are labeled.

For the frames that had no labels, the rapidly moving parts of those frames quickly “fade” in depth or appear to move away from the camera. This is clearly not the case, as can be seen in Fig. 7. If all of the frames are labelled appropriately, then the depth remains consistent over all the frames. While labeling every frame produces the best results, it is not the easiest task to perform. Even for a modest sequence, labeling each frame manually is quite tedious, and would take a considerable amount of time. However, if the labeling is performed in a more automated fashion, this would certainly simplify the task of creating a more detailed labeling, and thus increase the accuracy of the depth maps. The way we perform this is through the use of a computer vision tracking algorithm for tracking objects in unconstrained video. The idea is to only label the first frame with the user-defined depth labels. After, these labels are tracked throughout the entire video sequence. In most video shots, there will eventually be areas in the video where the depth will change from what the user defined those areas to be. As such, there has to be a way of dynamically adjusting those depths when those situations arise. We have devised such a method, and we will now go into this in greater detail.


Label Tracking As user-defined depth strokes will inevitably vary in shape and length, it would be computationally intensive to track all points assigned an initial depth, thus hindering the system's real-time property. As such, we consider a single stroke S, which has an associated user-defined depth, du, to be an ordered set of N points. Within the stroke, if it were to be uniformly subsampled to contain roughly five to ten points, performing cubic spline interpolation in between the points could be represented faithfully with a high degree of fidelity. In addition, the width of each stroke needs to be stored for faithful reconstruction. As such, in our framework, each stroke is thinned using binary morphological thinning, to reduce the stroke to a minimally connected one. After, the stroke is uniformly subsampled, and finally represented by these points. In the case of where the user draws overlapping strokes at the same depth, a clustering is performed to merge these to a single stroke. All subsampled points are individually tracked using a robust tracking algorithm, and their depths are automatically adjusted when necessary. After tracking of the points, cubic spline interpolation with the stored widths is used construct the final strokes. This overall process inevitably creates “user"-defined strokes across all frames, except that a tracking algorithm was used. This reduces the amount of interaction, while maintaining the same accuracy for marking frames as would be done by the user.

Robust Tracking Algorithm In order to robustly track the label points, a long term tracker should be employed, which can properly handle unconstrained cases over video. Most classical techniques do not handle these situations very well. Recently, Kalal et al. [8][9] designed such a framework, closely integrating adaptive tracking with online learning of the appearances that are specific to the object desired to be tracked. The user simply draws a bounding box around the object of interest in the first frame, and its trajectory is determined over the entire video sequence using only information from the first frame, as well as the online learning. The online detector is designed to incorporate the multiple appearances of the object, including size and illumination changes, that will inevitably be encountered in the sequence, and thus making it very robust. We use this to track each thinned point along a stroke individually, with a bounding box centered at each point.

Robust Adjustment of Depths When the objects move farther or closer to the camera, the depths need to be automatically adjusted to reflect this. To realize this, the ability of the aforementioned framework to detect objects over multiple scales is the main feature used. Let S = {s0, s1, …, si, …, sM-1} represent the scales of the initial bounding box of size R x C created in the first frame, such that for a scale bi, the bounding box has dimensions siR x siC. When tracking is performed, as the object moves closer to the camera, the size of the bounding box required for the object will inevitably increase, as its perceived size will increase. The inverse can be said when moving farther away. It should be noted that this kind of motion will most likely be non-linear in nature. Therefore, a mapping function that describes the relationship between the scale of the detected bounding box, with the labelled depth to be assigned for that point by simple parabolic perturbation is used. If D(x) represents the mapping function for a scale x ϵ S, therefore D(x) = ax2 + bx + c. In addition, two details are already known. The smallest scale b0 can be represented as the farthest depth, d0, and the largest scale bM-1 can be represented as the closest depth, dmax. Finally, the initial bounding box has a scale of 1.0, with the depth assigned from the user, du, and the coefficients are now solved by the inverse of the following system:


�(𝑏0)2 𝑏0 1

1(𝑏𝑀−1)2

1𝑏𝑀−1

1 1� �𝑎𝑏𝑐� = �

𝑑0𝑑𝑢𝑑𝑚𝑎𝑥

� (16)

For each frame, the bounding box scale automatically adjusts the depth at this point along a stroke. This relationship between the scales and depths functions quite well; if the current bounding box is the original scale, only horizontal motion exists. As it moves closer and farther, the depth is automatically adjusted within this bounding box. This functions well when different parts of the object appear at different depths (i.e. when the object is at an angle).

Results

Image Examples In this section, we show examples using images or single frames. We start with a frame from the Avatar film trailer. This shows the Unobtanium mineral floating on a levitation pad. Fig. 11 illustrates this image on the left, with the associated depth map on the right using our framework for converting images. For this situation, we decided to use gray level intensities to reflect the user-defined strokes. Darker and lighter strokes denote farther and closer points respectively. In addition, the user-defined strokes superimposed are shown superimposed on the Avatar frame. As we can see in the depth map, the depths of the hands, the levitation pad, as well as the rock are captured quite well. The internal variation of the objects will minimize the cardboard cutout effect. Also, with respect to the background, we can see that there is a smooth depth gradient that is seen for areas farther from the camera. In addition, the depth prior respects the edges and object boundaries, while there some depth variation within their internals, clearly showing the merits of both methods to give a more accurate result when combining them. Also, by examining the image on the left, not that many user-defined depth strokes are required to get a good quality depth map.

Figure 11. An Avatar frame along with its associated user-defined labeling (left), and the final

depth map (right)

For our last example, Fig. 12 illustrates a shot of downtown Boston, where the image on the left has its labels superimposed, while the image on the right is the resulting depth map. For the user-defined labeling, we decided to use a color scheme to clearly illustrate the depths at these strokes. The resultant depth map shows a very good representation of depth in the scene. With respect to the background, there is also a smooth depth gradient for areas farther from the camera. The edges and object boundaries, especially for those along the buildings, are well defined, with internal depth variation like that which is seen in Fig. 11. The depth prior also helps in creating a good quality depth map, and there are only a few brush strokes that are required to create this result.


Figure 12. Downtown Boston with its associated labeling (left), and the final depth map (right)

Video Example We present an example where an object is non-stationary, coming from the Shell-Ferrari-Partizan sequence. Only the first frame is labeled, and what is interesting is that the race car is, at first, close to the camera, and then starts to move farther and farther away. This clip is a good test in order to see how well the robust tracker can robustly change the depth of the stroke as the shot proceeds in time. Figs. 13 and 14 on the next two pages show the tracking results, and the depth map results. For both figures, the first row denotes the first frame, while the next two rows from left to right, top to bottom illustrate some frames as the sequence progresses. It should be noted that these rows are results from the tracker and are not user-defined. As seen in Fig. 13, the depth labels along the body of the car become automatically adjusted as the car moves away from the camera. In addition, the rest of the scene is stationary, and so the robust tracker does not move the points in any way, as expected. The depth maps also reflects this depth adjustment by examining Fig. 14. The depths in the scene are stationary, except for the car. The car depths become farther from the camera, as the car moves away, as also expected. With respect to the depths, the object boundaries are well respected, with smooth depth gradients within them, as seen in previous examples.

Conclusions We presented a semi-automatic framework for obtaining depth maps for images and video sequences, converting them from 2D into stereoscopic 3D. Semi-automated algorithms are preferable to automated ones, as we can directly control the perceived depth for objects in the scene. Our work is similar to Guttmann et al.’s work but is simpler. We have incorporated two existing image segmentation algorithms in a novel way to produce stereoscopic images and video sequences. The incorporation of Graph Cuts into the Random Walks paradigm produces a result that is better than either on its own. However, the quality of the final map is dependent on both the user input and the depth prior. If the user has made an error in their labelling then the result will also be effected. In practice, the actual labelling process is quite intuitive and straightforward. Once the user has provided the keyframes, and if there are errors, they can adjust the labels so that they are properly placed on the different objects in the scene. Verifying that the depths are correct can be done very quickly, by finding the depths for just a single frame without considering any of the other frames. Once the user is satisfied that the labeling will work, they can allow the system to solve for the entire sequence. We provide the user with


Fig. 13 – The Shell-Ferrari-Partizan sequence. Top row – Original user-defined labeled frame.

Second row and Third rows – Various tracked frames throughout the sequence

a method to track labels based on a robust computer vision based tracker as a means of ensuring high accuracy, alleviating much user input. Finally, the presented framework is conceptually simpler than other approaches, and it is able to achieve similar results. Because our work is based on a semi-automatic approach, this makes it possible for a user to correct errors in the depth maps, something not easily possible with other automatic methods.

References [1] C. Fehn, R. de la Barre and S. Pastoor, “Interactive 3-DTV: Concepts and Key Technologies”, Proc. of the IEEE, 94(3): 524-538, March 2006.

[2] M. Guttman, L. Wolf and D. Cohen-Or, “Semi-automatic Stereo Extraction from Video Footage”, Proc. IEEE ICCV, 2009

[3] Y. Boykov, O. Veksler and R. Zabih, “Fast Approximate Energy Minimization via Graph Cuts”, IEEE Trans. on PAMI, 23(11): 1222-1239, 2002.

[4] Y. Boykov and G. Funka-Lea, “Graph Cuts and Efficient N-D Image Segmentation”, Intl. Jnl. of Comp. Vis., 2(70):109-131, 2006


Fig. 14 – The Shell-Ferrari-Partizan sequence – Depth Maps. Top row – Original user-defined

labeled frame. Second and Third rows – Various tracked frames throughout the sequence

[5] L. Grady, “Random Walks for Image Segmentation”, IEEE Trans. on PAMI, 28(11): 1768-1783, 2006.

[6] R. Rzeszutek, T. El-Maraghi and D. Androutsos, “Interactive Rotoscoping through Scale-Space Random Walks”, Proc. IEEE ICME, pp. 1334-1337, 2009

[7] R. Phan, R. Rzeszutek and D. Androutsos, “Semi-Automatic 2D to 3D Image Conversion using Scale-Space Random Walks and a Graph Cuts Based Depth Prior”, Proc. IEEE ICIP, pp. 865-868, 2011.

[8] Z. Kalal, J. Matas, and K. Mikolajczyk, “Online Learning of Robust Object Detectors during Unstable Tracking,” Proc. IEEE ICCV - 3rd On-Line Learning for Comp. Vis. Workshop, 2009.

[9] Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N Learning: Bootstrapping Binary Classifiers by Structural Constraints”, Proc. IEEE CVPR, 2010.


raymond phan, richard rzeszutek and dimitrios androutsos · raymond phan, richard rzeszutek and...

Documents