ieee transactions on multimedia, vol. 15, no. 1, january ... links/mtech/matlab/basepaper… ·...

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013 129

Casual Stereoscopic Photo AuthoringFeng Liu, Yuzhen Niu, and Hailin Jin

Abstract—Stereoscopic 3D displays becomemore andmore pop-ular these years. However, authoring high-quality stereoscopic 3Dcontent remains challenging. In this paper, we present a methodfor easy stereoscopic photo authoring with a regular (monocular)camera. Our method takes two images or video frames using amonocular camera as input and transforms them into a stereo-scopic image pair that provides a pleasant viewing experience.The key technique of our method is a perceptual-plausible imagerectification algorithm that warps the input image pairs to meetthe stereoscopic geometric constraint while avoiding noticeablevisual distortion. Our method uses spatially-varying mesh-basedimage warps. Our warping method encodes a variety of con-straints to best meet the stereoscopic geometric constraint andminimize visual distortion. Since each energy term is quadratic,our method eventually formulates the warping problem as aquadratic energy minimization which is solved efficiently usinga sparse linear solver. Our method also allows both local andglobal adjustments of the disparities, an important property foradapting resulting stereoscopic images to different viewing con-ditions. Our experiments demonstrate that our spatially-varyingwarping technique can better support casual stereoscopic photoauthoring than existing methods and our results and user studyshow that our method can effectively use casually-taken photos tocreate high-quality stereoscopic photos that deliver a pleasant 3Dviewing experience.

Index Terms—Stereoscopic photo authoring, stereoscopic pho-tography, image rectification.

I. INTRODUCTION

S TEREOSCOPIC photography records a pair of images ofa scene as it is seen by the two eyes of a viewer. Compared

to a single image, a stereoscopic image pair has the advantageof enhanced depth perception due to the help of an additionaldepth cue, stereopsis, which is present only between two im-ages. Since stereoscopic photography re-creates the illusion ofdepth, it provides a lifelike viewing experience.Stereoscopic photography first appeared around one hundred

years ago. However, the stereoscopic 3D displays were not asaccessible to common users as regular 2D displays. Creatingappropriate stereoscopic content has been even harder than 2D

Manuscript received July 24, 2011; revised December 05, 2011 and March14, 2012; accepted June 09, 2012. Date of publication October 16, 2012; dateof current version December 12, 2012. This work was supported by the PortlandState University Faculty Enhancement Grant. The associate editor coordinatingthe review of this manuscript and approving it for publication was Dr. Xian-Sheng Hua.F. Liu and Y. Niu are with the Computer Science Department, Portland State

University, Portland, OR 97207 USA (e-mail: [email protected]; [email protected]).H. Jin is with the Advanced Technology Labs, Adobe Systems Incorporated,

San Jose, CA 95110 USA (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TMM.2012.2225033

Fig. 1. Stereo camera rig. A stereo camera rig is rectified to simulate the humanvision system. The optical axes of the two cameras are parallel to each other andperpendicular to the baseline, as shown in (b). A pair of images casually takenby a regular monocular camera, however, usually do not meet this requirement,as shown in (a). This paper presents a technique to rectify such two imagesinto a stereo image pair as if they were taken by a rectified stereo camera rig.(a) Unrectified camera rig, (b) Rectified camera rig.

images and videos. As a result, stereoscopic 3D remained un-popular until recent years. These years we have been observinga tremendous resurgence in the interest of stereoscopic 3D, par-ticularly in the entertainment and consumer electronics indus-tries. A wide variety of 3D devices are available, ranging from3D televisions to a range of mobile devices. However, creatinghigh-quality stereoscopic 3D content still remains a challengefor common users as stereoscopic photography not only requiresspecial devices but also more expertise than traditional (monoc-ular) photography.A good stereoscopic image pair has to satisfy a particular geo-

metric constraint in order for the human visual system to fusethe images to create depth perception: the corresponding partsin the two images must have the same vertical coordinates. Thisconstraint rises from the fact that the two coordinate systems ofthe two human eyes are mostly parallel. In this paper, we willrefer to this constraint as the stereo constraint.A stereoscopic image pair can be captured in many ways, for

instance, using a custom-built rig of two cameras to simulatetwo human eyes. As shown in Fig. 1(b), a stereoscopic camerasystem has two cameras (lenses). These two cameras have thesame intrinsic camera parameters and the same orientation.Their optical axes are parallel to each other and perpendicularto the baseline. These two cameras are typically separatedfrom each other by 2.5 inches, which are roughly the distancebetween two human eyes. Occasionally, these two camerasare carefully toed in slightly for better depth composition. Thecamera rigs are difficult for common users to design and use.The emerging consumer-level binocular camera systems, suchas the FinePix REAL 3D W3 cameras, make it easier to createa stereo image pair. However, professional binocular cameraswould be more difficult to manufacture and use due to thenecessarily large form factor.

1520-9210/$31.00 © 2012 IEEE

130 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 15, NO. 1, JANUARY 2013

In this paper, we are interested in making a stereoscopic photousing two images captured by a hand-held monocular camera,which we call casual stereoscopic photography. One can firsttake an image, move the camera horizontally about 2.5 inches,and take the second image. These two images generally do notsatisfy the stereo constraint, as depicted in Fig. 1(a).Image rectification methods in computer vision can be used

to bring these two images to meet the stereo constraint [10].However, most image rectification algorithms are designedfor machine vision rather than human vision. In particular,they try to satisfy the stereo constraint across the entire imagesthrough projective transformations. As shown in Fig. 2(b) and(d), while projective transformations can be used to meet thegeometric constraint required by the human visual system interms of fusing the two images, they tend to introduce visualdistortion that is objectionable to a human observer and losecontent when the images are cropped for rectangular displays.We argue that a desirable image rectification algorithm for

casual stereoscopic photography should satisfy three require-ments: enforce the stereo constraint, avoid introducing visualdistortion, and keep asmuch content as possible during the crop-ping stage. Since projective transformations cannot always sat-isfy these three requirements, we turn to more general imagetransformations. Inspired by recent work on image and videowarping [14], [16], [27], [29], [30], we propose to use mesh-based image warps. We cast the problem of image rectificationas energy-minimization-based image warping. Our energy isspecifically designed to enforce the stereo constraint, minimizevisual distortion, and prevent potential content loss in cropping.As shown in Fig. 2(c) and (e), our method is able to outperformexisting projective transformation based methods in meeting allthree requirements simultaneously.Our mesh-based warping formulation enjoys another ad-

vantage that the projective transformation based rectificationmethods do not: for stereoscopic photography, it is important tobe able to adjust the disparities between two images accordingto different viewing conditions, for instance, screen sizes andviewing distances, etc. [8], [15], [20]. While projective trans-formation-based methods cannot account for recently proposeddisparity adjustment operators [15], our warping-based rectifi-cation algorithm can easily support these operators.The rest of our paper is organized as follows. We first give a

brief overview on existing image rectification methods as wellas recent work on stereoscopic photo and video authoring andediting in Section II. We then describe our spatially-varyingwarping-based perceptual-plausible rectification methodand how it supports casual stereoscopic photo authoring inSection III. We compare our method to the state-of-the-artmethods, report our user study, and discuss our method inSection IV. We finally conclude this paper in Section V.

II. RELATED WORK

In this section, we first review researches on image rectifi-cation as they are most relevant to our key technique. We thengive a brief overview on relevant work on stereoscopic contentauthoring and manipulation.Image rectification is a well-studied problem in computer

vision. Given two images, a rectification algorithm determines

Fig. 2. Most existing image rectification algorithms are designed for stereomatching. They apply homographies to image rectification and often produceresults with large visual distortion. For example, the buildings are slanted in (b).Our method uses general mesh-based image warps and is able to both rectify theimages and avoid visual distortion (c). In (d) and (e), we show cropped versionsof (b) and (c) rendered in red-cyan anaglyphs. A red-cyan color anaglyph imageis created by taking its red channel from the left image and its blue and greenchannels from the right image. It provides a stereoscopic 3D perception whenit is viewed with a red-cyan glasses. One can see that our method is also able topreserve more content than [9]. (a) Two input images. (b) Classic rectificationresults [9]. (c) Our rectification results. (d) Cropped anaglyph of (b). (e) Croppedanaglyph of (c).

one transformation for each image such that correspondingepipolar lines coincide and are parallel to one of the imageaxes (typically the horizontal one). The significance of imagerectification is that it makes computing point correspondenceeasier because correspondence search can be restricted to animage axis. Existing algorithms rectify two images by findinga pair of homographies and using them to transform the twoimages. A homography is a 2D perspective matrix described by8 parameters. The relationship between corresponding pointsin two images and can be described by a homography asfollows:

(1)

LIU et al.: CASUAL STEREOSCOPIC PHOTO AUTHORING 131

where and are the homogenous coor-dinates of the two points in image and , respectively. Poorlychosen homographies often lead to large perspective distortionin rectification results.Most existing rectification algorithms can be categorized into

two, calibrated and uncalibrated, according to the assumptionsthey make on cameras. Calibrated rectification algorithms [2],[6] assume full knowledge of camera calibration, i.e., to knowboth camera intrinsic and extrinsic parameters. The transforma-tions to be determined are 3D camera rotations. As a result,the amount of visual distortion that calibrated rectification algo-rithms introduce is typically small. However, they have limitedpractical applicability because of the strong assumptions theyimpose on camera parameters.On the other hand, uncalibrated rectification algorithms [1],

[9], [17], [25], [7], [24], [19], [31], [5], [21] only require theknowledge of point correspondence between two images andtherefore have much wider applicability. In the uncalibratedcase, the transformations to be determined are 2D homogra-phies. Since poorly chosen homographies often lead to large vi-sual distortion, a significant part of uncalibrated algorithms isfocused on finding the homographies that have minimal visualdistortion. Hartley [9] suggested minimizing the disparity rangebetween the two images. Mallow andWhelan [19] andMonasseet al. [21] proposed to maximize viewpoint similarities betweenthe original and rectified images. Loop and Zhang [17] proposedto use homographies that are close to affine transformations,while Fusiello and Irsara [5] proposed to use homographiesderived from a quasi-Euclidean reconstruction. Gluckman andNayar [7] proposed to minimize over-sampling and under-sam-pling. Among all the uncalibrated algorithms, Zhou and Li [31]is the closest one related to this work as they were interestedin the same problem of finding transformations that are suitablefor stereoscopic displays. They proposed to use homographiesthat yield similar disparities for points with similar depth.All the aforementioned rectification algorithms use homogra-

phies as the transformations and therefore have limited power incombating visual distortion. Instead of homographies, we pro-pose to use general mesh-based image warps. Our work is mo-tivated by the success of recent image warping techniques, suchas shape manipulation [12], image retargeting [30], [29], [14],video stabilization [16], and artistic perspective manipulation[3].Creating and manipulating stereo content have intrigued in-

terests in computer vision and graphics. Converting monocularvideos to stereoscopic ones becomes particularly hot (cf. [22],[13], [8], [26]). One of the most important conversion stepsis depth estimation. A range of methods have been employed,including structure-from-motion [22], [13], learning-basedmethods [26] and incorporating user scribbles [8]. With thedepth information, image-based rendering methods, such as[13], or warping-based methods, such as [8], are used forstereo-pair synthesis.Different from 2D content, stereoscopic photos need to be

adapted to different viewing scenarios [20]. Niu et al. devel-oped a method to crop stereoscopic photos to heterogenousdisplays [23]. Wang and Sawchuk developed a disparity manip-ulation system that combines image warping and data-filling

techniques for novel view synthesis according to the newdisparity map [28]. Lang et al. further discussed the importantperceptual aspects of stereo vision and their implications forstereoscopic content creation, and then provided a set of basicdisparity mapping operators accordingly. They developed astereoscopic warping method to process the input video streamsand achieve the desirable disparity distribution [15]. Chang etal. recently developed a technique that non-uniformly resizes astereoscopic photo and fits it into target displays with differentsizes and aspect ratios [4]. These papers inspired our research.Their disparity mapping operators can be easily implementedinto our framework to provide better support for disparityadjustment.

III. PERCEPTUALLY-PLAUSIBLE RECTIFICATIONFOR CASUAL STEREO PHOTOGRAPHY

Our method takes two images as inputs and outputstwo images that make a good stereoscopic photo bymeeting the stereo constraint with minimal visual distortion. Asshown in Fig. 3, we divide each image into a uniform grid meshof size , where and are computed according tothe original image size and the mesh cell sizeas and . In our system, the cellsize is 10 10. Our method formulates rectification as a mesh-based spatially-varying warping problem, where we encode theepipolar constraint and visual distortion using quadratic energyterms, and finally solve the quadratic minimization problem forthe stereoscopic image pair.Enforcing the stereo constraint needs to establish the cor-

respondence between the left and right image. Our methodestimates a sparse set of corresponding feature pairs betweentwo images and applies the constraint on these feature points,because our human vision system is sensitive to these fea-tures in images. We denote the feature correspondences as

. Our method uses a SIFT-basedfeature matching method [18]. We first detect SIFT features inthe stereo photo pair. Then, we match the features between theleft and right image. The best candidate match for each SIFTpoint in one image is found by identifying its nearest neighborin the other image. As suggested by Lowe [18], a criterion of agood SIFT match can be the ratio between the distance of theclosest neighbor to that of the second closest neighbor. Oncewe find the candidate matching pairs, we eliminate the outliersusing the epipolar geometry constraint of two stereo images[10].The previous research on video stabilization [16] shows that

a mesh-based spatially-varying warping method works best ifthe amount of local deformation required to meet the warpingobjective is small. So our method first estimates a good approxi-mation before the final warping step. Specifically, we compute apair of affine transformations that best meet all the requirementsto make a good stereoscopic photo. Then, we compute the targetfeature positions according to the optimal affine transformationsand use them to guide the mesh-based warping. Below we firstdescribe our spatially-varying warping method for image recti-fication in Section III.A, where we assume that the target featurepositions have been obtained. Then we describe how we actu-ally compute the target feature positions by computing a pair of


Fig. 3. Rectification algorithm overview. Our method detects feature points in the left and right input images and estimates their correspondence as shown in (a).Then our method divides each image into a uniform grid mesh and formulates the rectification as a mesh-based warping problem. Our method warps each inputimage guided by the target feature points (green points in (b)). (c) shows the cropping results from (b). (a) Input (Top: Left, Bottom: Right). (b) Warping results.(c) Cropped results (Bottom: Anaglyph).

affine transformations as an approximation for image rectifica-tion in Section III.B.

A. Spatially-Varying Warping for Image RectificationWe denote as the grid vertex at position in the left

input image and as the corresponding vertex in the leftoutput image. The notation for the right image is defined simi-larly with the superscript . The unknowns of our problem are

. For clarity,we omit the superscripts whenever appropriate.Stereo Constraint: Our method enforces the stereo constraint

between the two images by applying the constraint to a sparseset of matching feature points, because our human vision systemis sensitive to these features. The stereo constraint says that the-coordinates of the corresponding feature points in both im-ages should be the same. Suppose we know the -coordinatesof each feature pair in the output images, denoted with . Wedescribe how to compute in Section III.B. Since in generala matching feature pair are not grid vertices in eitherimage, we cannot apply the constraint to them directly. We needto transfer the constraint to the unknowns . We solvethis problem in the same way as in [16]. Specifically we repre-sent as a weighted combination of the 4 vertices that encloseas follows:

(2)

where are the 4 vertices enclosing and are the cor-responding bilinear interpolation coefficients that sum to 1. Wecalculate by finding and inverting the bilinear interpo-lation process from to [11]. We do the same to . Wemeasure the violation against the stereo constraint as follows:

(3)

where is the cost for the left image and is the -coor-dinate of vertex . We define , the cost for the right image,in a similar way.Disparity: Our method achieves the user edited disparity dis-

tribution by encouraging the -coordinate of each feature pointto be close to the desired coordinates as follows:

(4)

where is the -coordinate of vertex . We describe howto determine in Section III.B and III.C. Similarly, we define, the cost for the right image.Visual Distortion: Like previous methods [3], [12], [16], our

method encourages each local mesh cell to undergo a similaritytransformation to minimize local geometrical distortion. We usethe quadratic energy term from Igarashi et al. [12] to measurethe violation against the similarity transformation constraint.Specifically, a similarity transformation only allows an object tobe translated, uniformly scaled, and rotated. As shown in Fig. 4,if a cell only undergoes a similarity transformation, the coordi-nates of a vertex in the local coordinate system defined by theother two vertices and shall remain the same after trans-formation. With the local coordinates , we can calculate, the expected position of , in the local coordinate system

defined by and after transformation as follows:

(5)

As shown in Fig. 4, the coordinates of in the local coordinatesystem defined by and is . We can rewrite the aboveequation as

(6)


Fig. 4. and are vertices of an input mesh cell, and , andare the corresponding output. Given , and , the expected position for

is if the cell undergoes a similarity transformation. We measure the violationagainst the similarity transformation as the distance between and .

The expected positions for all other three vertices can be cal-culated similarly. We define the violation against the similaritytransformation as the distance between the expected position ofeach vertex and its actual position as follows:

(7)

where is a cell and is the saliency value of the cell .Our method encourages the salient regions to undergo more re-strictive similarity transformation than the less salient ones. Avariety of methods exist for saliency estimation. Our methoduses a method from [16] that computes the saliency value foreach cell as the color variance within the cell. A user could alsouse a brush tool to mark the saliency value for each image. Wedefine the violation for the right image in a similar way.Line Constraint: Long straight edges are strong visual cues

in an image. Bending a long straight edge often leads to objec-tionable artifacts. Also having conflicting slopes for the samestraight edge in two images is considered distortion in stereo[20]. It is desired to keep the corresponding straight edges intwo views roughly parallel. To achieve this goal, our methodallows a user to mark straight lines in each image and specifydesired slopes for them. Based on the user input, we first ob-tain , a set of points uniformly sampled along each line andthen measure the violations against this line constraint as thesum of the differences between the target slope and the slope ofeach line segment . A common metric for measuring theslope between two lines is the angle between them. However,this leads to a non-quadratic energy term. Instead, we use thefollowing approximation:

(8)

where is the cross product operator between two 2D vec-tors defined as and

is a unit vector along the expected output linewith the slop value . In this way, the energy term formeasuring the line constraint is quadratic. Again, we computethis term for both the left and right image.The total energy is a weighted sum of all the aforementioned

terms:

(9)

where , and are weights, with default values1.0, 4.0, 1.0, and 10.0. These values are chosen empiricallyto achieve a good balance among all competing goals in cre-ating a good stereoscopic photo. Since there is no interactionbetween the variables in the left and right images, we can solvefor each image independently. Specifically, we solve the fol-lowing problem for the left image:

(10)

We solve (10) using a standard sparse linear solver and obtainthe output mesh for the left image. We render the final resultusing texture mapping. We create the right output image in thesame way.

B. Target Feature Position Estimation

We describe below how we compute the target feature po-sitions for the two images to guide the warping process in theprevious section. We consider a good target point set for eachimage that meets the stereo constraint, preserves the quality ofthe original image (minimal visual distortion and content loss),and respects user specifications.Multi-view geometry research shows that estimating a pair

of homographies and applying each homography to the corre-sponding image in the two images can exactly meet the stereoconstraint [10]. However, a homography will introduce perspec-tive distortion and cause the significant loss of image contentafter rectification and cropping. Our idea is to use a transforma-tion that is free from perspective distortion, such as an affinetransformation. An affine transformation between two imagesand can be described as

(11)

where is the affine transformation matrix between and ,and and are the coordinates of two matchingfeature points in these two images, respectively.A pair of affine transformations cannot meet the stereo con-

straint exactly; but they introduce less objectionable visual dis-tortion than homographies. We denote the affine transformationpair to be estimated as , where and are the affinetransformations for the left and right image, respectively. Belowwe first describe how to find an optimal pair of affine transfor-mations that best meet the stereo constraint, minimize the vi-sual distortion, and content loss. We use this pair of affine trans-formations to produce the target feature positions. Then we de-scribe how to post-process the affine transformation results toexactly meet the stereo constraint.Stereo Constraint: Similar to our warping method, we en-

force the stereo constraint by enforcing the -coordinates ofmatching feature points between two images to be the same:

(12)

where and are the second rows of matrices and ,respectively.Coverage and Visual Distortion: The rectification results

need to be cropped to create a stereoscopic image. Cropping


will not only lose image content, but also damage the originalimage composition. We measure the content preserved afterrectification and cropping as the overlap between the croppingwindow and the original images. We refer to the overlap ascoverage. If the four corner points of each image stay at thesame positions as the original image, we have a maximal cov-erage. So our method maximize the coverage by encouragingthe corner points of each image to stay as close as possible totheir original positions:

(13)

where and are the four corner points for the left image andright image respectively. Since this term encourages an identitytransformation, it also encourages minimal distortion.Line Constraint: Similar to our warping method, we enforce

the line slope constraint. Let be the original slope of aline which a user specifies in an image. The slope of this lineafter an affine transformation is given by

(14)

We measure the slope difference of the transformed line and thedesired line as:

(15)

where is the desired slope for line .Alignment constraint. The left and right image in a good

stereo image pair are often very close to each other. So we en-courage the two images to be aligned horizontally.

(16)

where and are the first rows of matrices and ,respectively. As we describe below, we weight this constraintsignificantly less than the other terms, so our method preferablypreserves a small amount of horizontal parallax. This provides agood initialization for a user to adjust the disparity distributionas described below.We combine the energy terms above into the following

quadratic minimization problem.

(17)

The default values for the weighting parameters, ,and are 1, 10, 10, and 1000, respectively. We select thesevalues empirically. We solve this minimization problem using astandard linear solver.We apply the affine transformation pair to the input image pair

to match them approximately. Then we compose the matchedimage pair together and provide a user an initial result of thestereoscopic photo. A user can then edit the disparity distribu-tion using the disparity operators described in Section III.C. Ifthere is no user input, the default disparity .We evenly distribute the disparity change in two input images.

Accordingly, we calculate the coordinates of the output fea-ture points as follows:

(18)

While the affine transformation pair cannot exactly meet thestereo constraint, it provides a good approximation. To furtherrespect this constraint, our method computes the coordinatesof the output feature points as the mean of the coordinates inthe transformed left and right image. That is, we calculate thecoordinates of the output feature points

.

C. Disparity Adjustment

We provide several disparity adjustment operators for a userto manipulate the disparity distribution. First, we allow a user toshift one image horizontally to uniformly change the disparitydistribution. We also implement the global linear and nonlinearoperators proposed by Lang et al. [15] for a user to adjust thedisparity range and distribution. For example, the global linearoperator is defined as follows:

(19)

where and are the output and input disparity respectively.is a scaling coefficient and is the amount of horizontal shift.The global non-linear operators can be similarly defined basedon non-linear functions, such as Exponential and Logarithmicfunctions. Please refer to Lang et al. [15] for detail. Finally,we provide a region selection tool, and support local adjustmentusing the above operators. The new disparity values are used in(18) to compute the positions of the output feature points, whichare used in our warping framework (4).

IV. EXPERIMENTS

We first report the comparisons between our method and tworepresentative image rectification methods [5] and [9] and showexamples of a variety of disparity adjustment operators that aresupported by our rectification framework. We then report theperformance of our method and discuss limitations. We use red-cyan anaglyphs when appropriate to show both left and rightimages simultaneously. For our user study, we used an ASUSVG236H 120 Hz 3D monitor with shuttered glasses and NvidiaGeForce 3D Vision Solution for a better viewing experience.

A. Comparisons

To evaluate our method, we collected 20 pairs of images.The aspect ratios of these images include 4:3, 3:4, 2:3, 3:2, and16:9. Each image was resized so that its long dimension is 900pixels. These image pairs were taken by simulating a stereo-scopic camera using a monocular camera. We took one image,moved the monocular camera horizontally, and took the otherimage. Note, all these images were taken by a handheld camera,so while we tried to move the camera horizontally, we always in-troduced non-horizontal motions. In addition, although we tookcare, 4 out of these 20 image pairs still have scenemotion.More-over, taking two images from two different view points will usu-ally introduce occlusion. While occlusion happens as well when


Fig. 5. Four examples show that our method suffers less distortion than homography-based methods PR [9] and QEUR [5]. (a) Input images. (b) Results of PR(Rows 1 & 2) and QEUR (Rows 3 & 4). (c) Our results.

our two eyes look at a non-plane scene. However, as we showlater on, big occlusion between two images will also damage 3Dviewing experience. If we put the left and right image side byside, we find that we can quickly detect obvious occlusions for5 pairs of images.We compared our method with a classic projective transfor-

mation based rectification methods [9] (PR) and a more recentone [5] (QEUR). For PR, we used the OpenCV implementa-tion and for QEUR, we used the implementation shared by theauthors of QEUR. Note that both [9] and [5] are designed tominimize distortion. In our experiment, both PR and QEUR usethe same set of feature points as ours. For this comparison, ourmethod does not involve any user adjustment or user-definedline constraint. We ran PR, QEUR and our method on the imagecollection described above. Fig. 5 shows four representative re-sults. One can observe that our method is able to produce resultswith significantly less visual distortion than PR and QEUR. Forthe PR result shown in the middle of the first row, the top ofthe sculpture is anisotropically distorted. For the QEUR resultshown in the middle of the last row, the whole image is perspec-tively distorted. The persons on the right as indicated by the redrectangle are seriously stretched.In addition to qualitatively measuring visual distortion, we

quantitatively measured how well the stereo constraint wassatisfied by the three algorithms. In theory, projective trans-

formation based methods, such as PR and QEUR, shouldbe able to eliminate vertical disparities. But, in practice,many issues such as errors in feature matching and funda-mental matrix estimation, poor feature distributions, and lensdistortion could lead to imperfect rectification results. Wecollected the average vertical disparity of PR, QEUR, ouraffine rectification step, and our mesh-based warping methodand showed them in Fig. 6(a). One can see that the affinerectification gives the largest vertical disparities on average,which is expected since affine transformations are a weakermodel than homographies and our mesh-based warps. Ourwarping method achieves the least vertical disparities: Theaverage vertical disparity of our method is 63.6% of that ofPR and 60.4% of that of QEUR.Rectified images often need to be cropped to obtain a stereo-

scopic image for regular displays. Cropping leads to loss of con-tent. We compared our methods with PR and QEUR in terms ofpreserved area in the original images. For all three methods, weperformed cropping by computing a maximal rectangle insidethe overlapping area of the two rectified images. We then pro-jected this rectangle back to the original image and computedthe coverage rate as the percentage of original pixels preservedin the cropping result. We show the coverage rates for the threemethods in Fig. 6(b). On average, our method preserves 15.6%more area than PR and 9.7% than QEUR.


Fig. 6. Experiments on 20 pairs of images show that our method is able to reduce more vertical disparities and preserve more area than [9] and [5]. In each figure,we also show the average performance of each method in the most right column. (a) Vertical disparity. The vertical axis is truncated to [0, 1.5] for clarity. (b) Areacoverage rate.

We acknowledge that comparing our method to othersin terms of the vertical disparity or coverage rate alone issomewhat unfair. But it is important to note that our methodoutperforms PR and QEUR in terms of both vertical disparitiesand area coverage, and does not introduce unpleasant visualdistortion.

B. User Studies

The quality of stereoscopic photo is subjective, so we per-formed two user studies to evaluate our method. In our firststudy, we tested whether our method can create stereoscopicphotos that deliver pleasant 3D viewing experience. In oursecond study, we tested whether our results were preferred tothe PR and QEUR results by viewers.1) 3D Perception and Comfort Test: Our method enforces

the stereo constraint on salient features instead of the wholeimage. While the above experiments show that our method cansuccessfully meet the stereo constraint on feature points, it re-mains unclear whether a viewer can easily fuse the resultantstereo image pair to create depth perception. We are also cu-rious how a viewer perceives the results from PR and QEUR. Inorder to answer this question, we carried out a user study. Thereare totally 10 participants with normal stereopsis perception in

our study. These participants include undergraduate, graduatestudents, and researchers. We did not explain to them how thestereoscopic photos were created. In our study, they sat roughly2 feet away from the 23 monitor. We showed each participantone stereoscopic photo created using each of the three methodsat each time and we showed totally stereoscopicphotos. There was no time constraint for the study. The partici-pant can freely select to view the next or previous image by con-trolling the photo browser. Each time, we asked the followingtwo questions.• Is it easy to perceive 3D?• Do you feel comfortable to view the photo?

For each question, we asked the participant to rate from 1 to3, with 3 being most positive. We compute the average scoresand the standard deviations and report them in Table I. For ourmethod, the average scores for the first question and secondquestion are 2.72 out of 3 and 2.44 out of 3 and the standarddeviations are 0.53 and 0.67, respectively. PR and QEURachieved similar results, although our method was moderatelybetter. The p-values of the paired two-sample t-tests betweenour method and PR are 0.26 and 0.02 on depth perceptionand comfort test, respectively. The p-values of the comparisonbetween our method and QEUR are 0.15 and 0.10 on depth


Fig. 7. Overview of our stereoscopic photo authoring system. Given an input pair of images, a user selects a few lines and decides their target orientations (a).Our method then computes their affine rectification approximations. These user specified line constraints sometimes are important: without them, the lines in theleft and right image cross each other, as shown in (b), instead of being roughly parallel (c). After seeing the approximations, a user can adjust the disparity mapand create the final stereoscopic image shown in (d). In this example, the negative disparity range is linearly compressed by 50%. The average disparities for theapproximation in (c) and the final result in (d) are 0.73 and 0.43 pixels, respectively. (a) Input. (b) Approximation w/o line constraint. (c) Approximation w/lineconstraint. (d) Final result.

TABLE I3D PERCEPTION AND COMFORT TEST RESULTS. OUR METHOD IS SLIGHTLYBETTER THAN PR AND QEUR. THE P-VALUES OF THE COMPARISONS(OURS VERSUS PR ON 3D PERCEPTION, OURS VERSUS PR ON COMFORTTEST, OURS VERSUS QEUR ON 3D PERCEPTION, AND OURS VERSUS

QEUR ON COMFORT TEST) ARE 0.26, 0.02, 0.15, AND 0.10

perception and comfort test, respectively. These suggest thatall these methods can in general create stereoscopic photos thatprovide a pleasant viewing experience. The major complainfrom participants is sometimes they cannot fuse two photos toenjoy the 3D perception. We looked into these cases. We foundthat they were mainly caused by scene dynamics.2) Subjective Comparison: The previous study showed that

both our method and the PR and QEUR methods can in generalcreate stereoscopic photos that deliver reasonable 3D viewingexperience although our method is moderately better. In thesecond study, we further compared our results to the others. Thisstudy involved the same 10 participants in the previous studyand the testing environment was the same. In this study, for eachimage, we compared our result to each of the PR and QEUR re-sults. Specifically, at each time, we placed our result and oneof the other results side by side. Whether our result was shownon the left or right was randomized to avoid position bias. Inthis way, there were totally comparisons. Eachparticipant was shown one comparison at a time. Like the firststudy, there was no time limit and the participant could controlthe study pace. For each comparison, we asked the participantwhich image she/he likes more, and the participant only neededto answer left or right. Interestingly, many users provide narra-tive feedbacks during the study, which helped us better interpretthe study results.To assess our method, we counted the number of trials where

our results were selected. On average, participants chose our re-sult over a PR result 13.80 out of 20 times (69.0%), and over aQEUR result 15.20 out of 20 times (76.0%). These results sug-gest that our method is preferred to the other methods moder-ately. We computed the significance using one-sample t-tests

. For each two-forced choice, if the participant chose

TABLE IISUBJECTIVE COMPARISON. OUR RESULTS AREMODERATELY PREFERRED TO THE OTHERS

our result, the sample data was 1. Otherwise, it was 0. We foundall the p values to be smaller than . We report the resultsin Table II. According to narrative user feedbacks, we found thatthey preferred our results mostly because our results were lessdistorted. We also found that for about 4 images, it was difficultfor users to tell which results they liked more.

C. Disparity Adjustment

Our rectification algorithm allows manual adjustments of dis-parities using the operators described in Section III.C. In Fig. 7we show several examples of how these adjustment are used to-ward creating visually pleasant stereoscopic images. Fig. 7(a)shows two input images. The user specified several lines withdesired slopes. Our method first computed an affine rectificationapproximation without the line constraints using the method inSection III.B. We can see that without the line constraints thewindow frames are not parallel between the left and right im-ages as shown in (b). In (c) we show the affine approximationwith line constraints. We can see that the window frames areapproximately parallel. We show the final warping results withdisparity adjustments in (d). In this example, we compressed thenegative disparity range linearly by 50%. The average dispari-ties for the approximation in (c) and the final result in (d) are0.73 and 0.43 pixels, respectively.In Fig. 8, we show more examples of disparity adjustments.

Fig. 8(a) shows the affine rectification approximation result,which provides a starting point for a user to obtain the finaldesirable disparity map. One can see that the overall disparityrange and particularly the negative disparity in the flower re-gions are large. We shifted one image to make the disparity ofthe flowers smaller, which makes it easy to fuse the flowers, asshown in (b). However, this operation led to even larger dis-parity in the background. We then linearly compressed the dis-parity range across the entire image by reducing each disparity


Fig. 8. Global disparity adjustment. (a) An affine-rectification result, where the disparity range and the negative disparity are very big. (b) We shifted one imageto align the flowers, which makes it easy to focus on the flowers. However, this led to even larger disparities in the background. (c) We linearly compressed thedisparity range globally. (d) We shifted and compressed the disparity range to move the sculpture closer to the screen. (a) Rectification approximation. (b) Uniformdisparity shift. (c) Global disparity range compression. (d) Move the sculpture closer to the screen (from left to right).

value by 40%, as shown in (c). Fig. 8(d) shows that we can movethe object of interest (the sculpture) closer to the screen in a sim-ilar way to the previous example.We also developed local versions of the global disparity ad-

justment operators. Our system provides various region selec-tion tools for a user to apply disparity adjustment only to se-lected regions. Fig. 9 shows an example where our method isable to reduce the disparities of the statue by 40%, 60% and80%, respectively, while roughly maintaining the disparities inthe rest of the image.

D. Performance

We implemented our Casual Stereoscopic Photo AuthoringSystem using C++. In our experiment, our method divided eachimage into a uniform grid mesh with the grid cell size 10 10(pixels). Our method is robust with different grid cell sizes.However, a very large grid cell size, which leads to a coarsegrid mesh, can sometimes make the disparity and stereo con-straint different to meet well. On the other hand, a very smallgrid cell size brings in more variables in the energy functionand takes more time to solve the optimization problem. More-over, it sometimes leads to visible geometric distortion.Our system takes about 10 seconds to process two input im-

ages of size 900 600. The majority of computational cost ison SIFT feature extraction and matching, which takes around9–10 seconds. For images of size 900 600, there are typically1000 to 2000 SIFT features. The time spent on the affine approx-imation step is negligible. The warping step, mainly solving the

quadratic energy in Section III, takes about 0.5 seconds. Our ex-periments use the same set of feature points for PR and QEUR,so the cost for the feature extraction and matching step in thesetwo methods is the same as ours. The actual rectification costfor PR is negligible and for QEUR is around 1.5 seconds. Allthe experiments were run on a PC with a 3 GHz Intel Dual CoreCPU and 3 GB memory.

E. Limitations

Our further experiments revealed two main limitations ofour method. First, our method cannot handle scene dynamics.Scene dynamics tends to introduce ghosting artifacts, as shownin Fig. 10(a) and (b). This is a fundamental problem in cap-turing a stereoscopic image using a hand-held camera wheretwo images are recorded at different times. Second, our methodsometimes has trouble accounting for significant occlusionsand dis-occlusions in a visually plausible way, as shown inFig. 10(c) and (d). This relates to the general limit on howmuch mesh-based warping methods can deform an image in avisually plausible way [16], [27].

V. CONCLUSION

In this paper, we presented a method for casual stereoscopicphoto authoring from two images taken casually by a monoc-ular camera. Our method rectifies these two images into a stereoimage pair to deliver a pleasant viewing experience. Differentfrom existing image rectification methods that use homogra-phies, our method uses general mesh-based image warps. Amajor advantage of using such warps is that we are able to both


Fig. 9. Local disparity adjustment. Given the affine rectification approximation result (a), our method reduces the disparity of the statues by 40% (b), 60% (c) and80% (d), respectively. (a) Rectification approximation, (b) Local disparity reduction by 40%, (c) Local disparity reduction by 60%, (d) Local disparity reductionby 80%.

Fig. 10. (a) shows a pair of input photos with people walking around. This leads to the strong ghosting artifacts in our result (b). (c) shows a pair of photos thatexhibit significant occlusion. This leads to retinal rivalry (d). (a) Input photos. (b) Our result. (c) Input photos. (d) Our result.

enforce minimal vertical disparity, which is crucial for stereo fu-sion, and minimize visual distortion and content loss, which isdesired for creating aesthetic images. We formulate the problemas an energy minimization where the stereo constraint and vi-sual distortion minimization are encoded using quadratic terms.The experiments show that our method outperforms classic ho-mography-based methods in terms of simultaneously reducingmore vertical disparity, suffering less from visual distortion, andpreserving more image content. We also demonstrate that ourmethod is able to support disparity adjustment and user speci-fied line constraints.

ACKNOWLEDGMENT

The authors would like to thank the editor and reviewers fortheir insightful and constructive comments.

REFERENCES[1] N. Ayache and C. Hansen, “Rectification of images for binocular and

trinocular stereovision,” in Proc. IEEE Int. Conf. Pattern Recognition,1988, pp. 11–16.

[2] N. Ayache and F. Lustman, “Trinocular stereo vision for robotics,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, pp. 73–85, 1991.

[3] R. Carroll, A. Agarwala, and M. Agrawala, “Image warps for artisticperspective manipulation,” ACM Trans. Graph., vol. 29, no. 4, pp.127:1–127:9, 2010.


[4] C.-H. Chang, C.-K. Liang, and Y.-Y. Chuang, “Content-aware dis-play adaptation and interactive editing for stereoscopic images,” IEEETrans. Multimedia, vol. 13, no. 4, pp. 589–601, 2011.

[5] A. Fusiello and L. Irsara, “Quasi-euclidean uncalibrated epipolar recti-fication,” in Proc. IEEE Int. Conf. Pattern Recognition, 2008, pp. 1–4.

[6] A. Fusiello, E. Trucco, and A. Verri, “A compact algorithm for recti-fication of stereo pairs,” Mach. Vis. Appl., vol. 12, no. 1, pp. 16–22,2000.

[7] J. Gluckman and S. Nayar, “Rectifying transformations that minimizeresampling effects,” in Proc. IEEE Conf. Computer Vision and PatternRecognition, 2001, pp. 111–117.

[8] M. Guttmann, L. Wolf, and D. Cohen-Or, “Semiautomatic stereo ex-traction from video footage,” inProc. IEEE Int. Conf. Computer Vision,2009, pp. 136–142.

[9] R. I. Hartley, “Theory and practice of projective rectification,” Int. J.Comput. Vis., vol. 35, no. 2, pp. 115–127, 1999.

[10] R. I. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge, U.K.: Cambridge Univ. Press, 2000.

[11] P. S. Heckbert, Fundamentals of Texture Mapping and ImageWarping,UC Berkeley, Tech. rep., 1989.

[12] T. Igarashi, T. Moscovich, and J. F. Hughes, “As-rigid-as-pos-sible shape manipulation,” ACM Trans. Graph., vol. 24, no. 3, pp.1134–1141, 2005.

[13] S. Knorr and T. Sikora, “An image-based rendering (ibr) approachfor realistic stereo view synthesis of tv broadcast based on structurefrom motion,” in Proc. IEEE Int. Conf. Image Processing, 2007, pp.572–575.

[14] P. Krähenbühl, M. Lang, A. Hornung, and M. Gross, “A system forretargeting of streaming video,” ACM Trans. Graph., vol. 28, no. 5,pp. 126:1–126:10, 2009.

[15] M. Lang, A. Hornung, O. Wang, S. Poulakos, A. Smolic, and M.Gross, “Nonlinear disparity mapping for stereoscopic 3d,” ACMTrans. Graph., vol. 29, no. 4, pp. 75:1–75:10, 2010.

[16] F. Liu, M. Gleicher, H. Jin, and A. Agarwala, “Content-preservingwarps for 3D video stabilization,” ACM Trans. Graph., vol. 28, no.3, pp. 44:1–44:9, 2009.

[17] C. Loop and Z. Zhang, “Computing rectifying homographies for stereovision,” in Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, 1999, pp. 125–131.

[18] D. G. Lowe, “Distinctive image features from scale-invariant key-points,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

[19] J. Mallon and P. Whelan, “Projective rectification from the funda-mental matrix,” Image Vis. Comput., vol. 23, no. 7, pp. 643–650, 2005.

[20] B. Mendiburu, 3D Movie Making: Stereoscopic Digital Cinema FromScript to Screen. New York: Focal Press, 2009.

[21] P. Monasse, J.-M. Morel, and Z. Tang, “Three-step image rectifica-tion,” in Proc. British Machine Vision Conf., 2010, pp. 89.1–89.10.

[22] K. Moustakas, D. Tzovaras, and M. Strintzis, “Stereoscopic videogeneration based on efficient layered structure and motion estimationfrom a monoscopic image sequence,” IEEE Trans. Circuits Syst. VideoTechnol., vol. 15, no. 8, pp. 1065–1073, 2008.

[23] Y. Niu, F. Liu, W. Feng, and H. Jin, “Aesthetics-based stereoscopicphoto cropping for heterogeneous displays,” IEEE Trans. Multimedia,vol. 14, no. 3, pp. 783–796, 2012.

[24] D. Oram, “Rectification for any epipolar geometry,” in Proc. BritishMachine Vision Conf., 2001, pp. 653–662.

[25] M. Pollefeys, R. Koch, and L. V. Gool, “A simple and efficient rectifi-cation method for general motion,” in Proc. IEEE Int. Conf. ComputerVision, 1999, pp. 496–501.

[26] A. Saxena, M. Sun, and A. Ng, “Make3d: Learning 3d scene structurefrom a single still image,” IEEE Trans. Pattern Anal. Mach. Intell., vol.31, no. 5, pp. 824–840, 2009.

[27] A. Shamir and O. Sorkine, “Visual media retargeting,” in ACM SIG-GRAPH ASIA 2009 Courses, 2009, pp. 11:1–11:13.

[28] C. Wang and A. A. Sawchuk, “Disparity manipulation for stereo im-ages and video,” in Proc. SPIE, 2008, vol. 6803, pp. E1–E12.

[29] Y.-S. Wang, C.-L. Tai, O. Sorkine, and T.-Y. Lee, “Optimized scale-and-stretch for image resizing,” ACM Trans. Graph., vol. 27, no. 5, pp.118:1–118:8, 2008.

[30] L.Wolf, M. Guttmann, and D. Cohen-Or, “Non-homogeneous content-driven video-retargeting,” in Proc. IEEE Int. Conf. Computer Vision,2007, pp. 1–6.

[31] J. Zhou and B. Li, “Image rectification for stereoscopic visualization,”J. Opt. Soc. Amer. A, vol. 25, no. 11, pp. 2721–2733, 2008.

Feng Liu is an Assistant Professor in the Departmentof Computer Science at Portland State University.His research interests are in the areas of computergraphics, vision, and multimedia. He earned his M.S.and Ph.D. in computer science from the Universityof Wisconsin, Madison in 2006 and 2010, respec-tively. He received his B.S. and M.S. in computerscience from Zhejiang University in 2001 and 2004,respectively.

Yuzhen Niu is a Postdoctoral Researcher in theDepartment of Computer Science at Portland StateUniversity. She received her B.S. and Ph.D. incomputer science from Shandong University, Jinan,China, in 2005 and 2010, respectively. Her researchinterests are in the areas of computer graphics, visionand multimedia.

Hailin Jin received his Bachelor’s degree in au-tomation from Tsinghua University, Beijing, Chinain 1998. He then received his Master of Science andDoctor of Science degrees in electrical engineeringfrom Washington University in Saint Louis in 2000and 2003 respectively. Between fall 2003 and fall2004, he was a postdoctoral researcher at the Com-puter Science Department, University of Californiaat Los Angeles. Since October 2004, he has been aresearch scientist at Adobe Systems Incorporated.

ieee transactions on multimedia, vol. 15, no. 1, january ... links/mtech/matlab/basepaper… ·...

Documents