inertial sensors aided image alignment and stitching for

9
Inertial Sensors Aided Image Alignment and Stitching for Panorama on Mobile Phones Qingxuan Yang Google Research Beijing, China [email protected] Chengu Wang IIIS, Tsinghua University Beijing, China [email protected] Yuan Gao Department of Computer Science and Technology Tsinghua University Beijing, China [email protected] Hang Qu Google Research Beijing, China [email protected] Edward Y. Chang Google Research Beijing, China [email protected] ABSTRACT In this paper, we propose using signals collected from in- ertial sensors on cameras to speed up image alignment for panorama construction. Inertial sensors including accelerom- eters and gyroscopes are first calibrated to improve sensing accuracy. These sensors are then used to estimate the posi- tion and orientation of each captured image frame. By know- ing the relative displacement of image frames, alignment can be performed with good accuracy and computational effi- ciency. Through examples we illustrate the effectiveness of inertial-sensor assisted panorama. ACM Classification Keywords I.3.3 Computer Graphics: Picture/Image Generation. General Terms Algorithms, Experimentation, Performance. Author Keywords image alignment, panorama, inertial navigation systems, im- age stitching INTRODUCTION While mobile phones are usually equipped with cameras, due to style and cost considerations, very few are equipped with an advanced lens. However, new generation smart phones such as the iPhone and Nexus come with high-capacity com- putational/graphical processing units and inertial navigation systems. In this work, we show that these CPUs/GPUs and inertial navigation systems (INS) can be utilized to improve the quality of images taken by inexpensive lenses of mobile phones. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MLBS’11, September 18, 2011, Beijing, China. Copyright 2011 ACM 978-1-4503-0928-8/11/09...$10.00. One useful imaging application on mobile phones is panorama, which aligns and stitches frames of limited field of view into ones with a larger field of view. With such capabil- ities, a mobile-phone camera could “scan” a document, a room, or an outdoor wide-angle scene for example. Tradi- tional panorama support requires a user to steadily hold a camera and then slowly pan the camera from one side to the other, sequentially snapping photos. Such requirements en- sure that each frame quality is of good enough quality and consistent enough with other frames such that frame align- ment can be performed effectively. However, such stringent usage requirements makes for poor user experience. Further- more, an undesirable camera movement, such as a camera tilt/rotation or uneven-speed panning, can degrade stitching quality. In this work, we show that using inertial sensors (accelerom- eters and gyroscopes) on mobile phones to register cam- era position, orientation, and moving trajectory can provide valuable information for the alignment, and subsequent stitch- ing of captured frames. In particular, this allows a user the ability to scan a scene with any camera movement trajec- tory. Each frame can then be processed and adjusted based on the camera’s orientation when the photos were taken. Next, alignment of adjacent frames can be performed using INS provided information. The inter-frame displacement in- formation provided by the sensors reduces the search space for finding matching inter-frame features. Since the search space is largely reduced and denser sampling can be con- ducted to ensure high-quality alignment and stitching, both stitching quality and speed can be improved. This paper makes two key contributions. First, we propose non-intrusive calibration methods to improve sensor accu- racy. Second, we show how collected signals from iner- tial sensors can be used to obtain speed enhancements in panorama construction. In the following sections we first describe detailed techniques in Section II, and then show the results in Section III. Section IV introduces future work. 21

Upload: others

Post on 10-Dec-2021

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Inertial sensors aided image alignment and stitching for

Inertial Sensors Aided Image Alignment and Stitching forPanorama on Mobile Phones

Qingxuan YangGoogle Research

Beijing, [email protected]

Chengu WangIIIS, Tsinghua University

Beijing, [email protected]

Yuan GaoDepartment of ComputerScience and Technology

Tsinghua UniversityBeijing, China

[email protected]

Hang QuGoogle Research

Beijing, [email protected]

Edward Y. ChangGoogle Research

Beijing, [email protected]

ABSTRACT

In this paper, we propose using signals collected from in-ertial sensors on cameras to speed up image alignment forpanorama construction. Inertial sensors including accelerom-eters and gyroscopes are first calibrated to improve sensingaccuracy. These sensors are then used to estimate the posi-tion and orientation of each captured image frame. By know-ing the relative displacement of image frames, alignment canbe performed with good accuracy and computational effi-ciency. Through examples we illustrate the effectiveness ofinertial-sensor assisted panorama.

ACM Classification Keywords

I.3.3 Computer Graphics: Picture/Image Generation.

General Terms

Algorithms, Experimentation, Performance.

Author Keywords

image alignment, panorama, inertial navigation systems, im-age stitching

INTRODUCTION

While mobile phones are usually equipped with cameras,due to style and cost considerations, very few are equippedwith an advanced lens. However, new generation smart phonessuch as the iPhone and Nexus come with high-capacity com-putational/graphical processing units and inertial navigationsystems. In this work, we show that these CPUs/GPUs andinertial navigation systems (INS) can be utilized to improvethe quality of images taken by inexpensive lenses of mobilephones.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MLBS’11, September 18, 2011, Beijing, China.

Copyright 2011 ACM 978-1-4503-0928-8/11/09...$10.00.

One useful imaging application on mobile phones is panorama,which aligns and stitches frames of limited field of viewinto ones with a larger field of view. With such capabil-ities, a mobile-phone camera could “scan” a document, aroom, or an outdoor wide-angle scene for example. Tradi-tional panorama support requires a user to steadily hold acamera and then slowly pan the camera from one side to theother, sequentially snapping photos. Such requirements en-sure that each frame quality is of good enough quality andconsistent enough with other frames such that frame align-ment can be performed effectively. However, such stringentusage requirements makes for poor user experience. Further-more, an undesirable camera movement, such as a cameratilt/rotation or uneven-speed panning, can degrade stitchingquality.

In this work, we show that using inertial sensors (accelerom-eters and gyroscopes) on mobile phones to register cam-era position, orientation, and moving trajectory can providevaluable information for the alignment, and subsequent stitch-ing of captured frames. In particular, this allows a user theability to scan a scene with any camera movement trajec-tory. Each frame can then be processed and adjusted basedon the camera’s orientation when the photos were taken.Next, alignment of adjacent frames can be performed usingINS provided information. The inter-frame displacement in-formation provided by the sensors reduces the search spacefor finding matching inter-frame features. Since the searchspace is largely reduced and denser sampling can be con-ducted to ensure high-quality alignment and stitching, bothstitching quality and speed can be improved.

This paper makes two key contributions. First, we proposenon-intrusive calibration methods to improve sensor accu-racy. Second, we show how collected signals from iner-tial sensors can be used to obtain speed enhancements inpanorama construction. In the following sections we firstdescribe detailed techniques in Section II, and then show theresults in Section III. Section IV introduces future work.

21

Page 2: Inertial sensors aided image alignment and stitching for

RELATED WORK

The alignment and subsequent stitching together of imageare among the widely studied areas in computer vision. Im-age alignment and stitching can be mostly divided in threesteps, namely, motion model registration, global alignment,and compostion. In the motion model registration step, pla-nar perspective motion, cylindrical coordinates and spheri-cal coordinates are the three most common motion models.The motion model used is either specified by the user or se-lected through the Bayesian algorithm. In the global align-ment step, images are aligned using either a pixel-based orfeature-based methods to detect the overlapping area in eachpair of images. In the composition step, images that havebeen aligned in the last step are stitched together.

Although the main aspects required to create panorama canbe clearly divided into three steps, there still remain severalopen issues. Firstly, almost all the motion models mentionedassume the camera is fixed in one place, and successivelyrotated to capture all the scenes, specifically; the cylindri-cal coordinates require the camera to be level. Secondly, themethod used to recognize the subset of images used to com-pose a panorama either relies on artificial input or dependson a time intensive feature-based alignment between eachpair of images. We believe that the incorporation of inertialsensors will solve the above two problems in an easy way.

Beyond the aforementioned benefits to using inertial sensorsin the construction of a panoramic image, state-of-the-art re-search projects focus on how to transplant current panoramatechniques to mobile phones. This is a difficult task since toprecisely align adjacent images and stitch them seamlesslyusing traditional methods requires high computational re-sources. Xiong and Pulli [10] designed a dynamic program-ming seam search algorithm to realize fast image stitchingand editing for panorama painting on mobile phones. How-ever, their algorithm still relies on registration transforma-tion to extract overlapping areas between images. This isundesirable because it has been verified to be an overly com-plicated procedure no matter whether feature-based or pixel-based approaches are used, unless the inertial sensors pro-vide the displacement information. Ha et al. [5] proposeda computationally efficient and performance improved im-age mosaic algorithm via integer arithmetic for mobile cam-era systems. The algorithm works only on the assumptionthat users move the cameras in one direction when capturingoverlapped scenes, which results in a pool user experience.

INERTIAL SENSORS AIDED IMAGES ALIGNMENT

When using a mobile phone to take several successive pho-tos, the relative displacement among the photos can be de-rived from the position and orientation (angular position, orattitude) of the mobile phone. If one knew the exact positionand orientation of each photo, one could stitch them directlywithout any knowledge of the content of the images. How-ever, the sensors embedded in mobile phones suffer fromerrors due to low manufacturing quality that originate fromcost constraints. To combat this, we need to first calibratethe sensors before use in order to estimate the position andorientation of the mobile phone.

In this section, we first introduce both the model and themethod for calibrating the phone-equipped accelerometer andgyroscope. We then present a pipeline including high passfilter, zero velocity revision and noise gate to track both theorientation and displacement of the camera. Third, we ex-pose how to use camera path tracking to predict the displace-ment between photos. Finally, we show that displacementprediction can speed up image alignment.

Sensor Calibration

There are two opportunities in which the inertial units in amobile phone may be calibrated: at the manufacturer and athome. The calibration process at the manufacturer can relyon external, expensive devices. However, once a user haspurchased a phone, the calibration process at home shouldbe one-time, non-intrusive, and certainly cannot rely on ex-ternal devices such as a turn table.

Preliminaries

The inertial sensors give us a measured value of the realworld. The aim of calibration is to ensure the measured valuecoincides with the real world value as much as possible. Wedenote a(t) and a(t) the real acceleration and the accelera-tion measured by the accelerometer, respectively; ω(t) andω(t) the real angular velocity and the angular velocity mea-sured by the gyroscope, respectively; m(t) and m(t) the realmagnetic field and the magnetic field measured by the dig-ital compass, respectively. All these real and measured val-ues are taken in reference to the phone coordinate referencesystem at time t.

Since the measured values are all taken in the phone coordi-nate reference system, we need to acquire the orientation ofthe phone when receiving each value. A rotation matrix is anorthogonal matrix with determinant of one. We represent theorientation of the phone by a 3× 3 rotation matrix R if left-multiplying R can move the phone from the reference place-ment to its current placement. Therefore, left-multiplying Rto the measured values can convert them from values in thephone coordinate reference system to those in the absolutecoordinate reference system.

Besides the rotation matrix, quaternion is another convenientnotation to represent rotation and orientation, which is morenumerically stable and efficient. Quaternion is a unit vector(a, b, c, d)T ∈ R

4. The relationships between the quaternion(a, b, c, d)T and the rotation matrix R = (rij)ij are givenbelow:

R =

a2 + b2 − c2 − d2 2(bc− ad) 2(bd+ ac)2(bc+ ad) a2 − b2 + c2 − d2 2(cd− ab)2(bd− ac) 2(cd+ ab) a2 − b2 − c2 + d2

a =√1 + r00 + r11 + r22/2

b = (r21 − r12)/4a

c = (r02 − r20)/4a

d = (r10 − r01)/4a.

22

Page 3: Inertial sensors aided image alignment and stitching for

The quaternion can also be continuously updated by angu-lar velocity. Suppose a rigid body is rotated at a uniformangular velocity ω from an orientation represented by q0 tothat represented by q1 during a time span ∆t. q1 can then bederived from q0 using Equation 1.

q1 =

(

cos(|ω|∆t/2)I+2 sin(|ω|∆t/2)

|ω| Fq(ω)

)

q0, (1)

where | · | is the L2 norm of a vector and

Fq(ω) =

0 −ωx −ωy −ωz

ωx 0 ωz −ωy

ωy −ωz 0 ωx

ωz ωy −ωx 0

.

Furthermore, if a rotation from some orientation o1 to ori-entation o2 is represented by q = (a, b, c, d)T , the inverseof q, i.e. the rotation from orientation o2 to o1, is q−1 =(a,−b,−c,−d)T . The composition of two rotations q =(a, b, c, d)T andq′ = (a′, b′, c′, d′)T (first rotate by q′ and then by q) is givenby qq′ = (aa′−bb′−cc′−dd′, ab′+a′b+cd′−c′d, ac′+a′c+db′ − d′b, ad′ + a′d+ bc′ − b′c)T . Also, the ratation quater-nion maintains the information of rotation axis and angle.The ratation axis is given by the normailization of (b, c, d)T .The rotation angle is given by 2 cos−1 a.

Error Model

For the accelerometer, we assume a linear error on each axis,i.e. a(t) = Sa(t) + γ + ǫ, where S is a 3 × 3 diagonalmatrix, γ is a 3 dimensional vector, and ǫ is a zero-mean 3dimensional Gaussian random vector.

For the gyroscope, we assume the error model is linear. For-mally speaking, ω(t) = Kω(t) + β + η, where K is a 3× 3matrix, β is a 3 dimensional vector, and η is a zero-mean 3dimensional Gaussian random vector.

For the digital compass, we assume there are no randomerrors, but allow deterministic errors. Formally, m(t) =f(m(t)) for a deterministic continuous function f . In otherwords, m(t′) = m(t′′) if we can rotate the phone from theorientation at t′ to the orientation at t′′ about the axis in thedirection of north.

Accelerometer Calibration

We use the method in [4] for the calibration of the accelerom-eter. The random error ǫ can be eliminated by averaging theaccelerometer readings spanning a time window. Allan Vari-ance [2] is utilized to find a shortest length time window suchthat we can eliminate the random error. After the removal ofthe random error, we use the constant acceleration of gravityto help calibrate the accelerometer, detailed in [4].

Topology Calibration for Gyroscope

Allan Variance is also used to eliminate the random errorη in the gyroscope. We design a brand new scheme calledtopology calibration to determine the scale factor errors Kas well as the biases β of the gyroscope.

When we rotate the phone in a space freely, m(t) runs on ansphere S, and m runs on a closed surface f(S). We find allpoints of intersection, i.e all pairs (t′, t′′) such that m(t′) =m(t′′). For each pair (t′, t′′), we compute the rotation Rbetween them by ω. Using the fact that the rotation axis ofR is north for each pair (t′, t′′), we want to find K and β thatconcentrate all the R’s. The details are given presently.

First, we compute all points of intersection of m. Becausevalues of m are given as discrete samples, we use linear in-terpolation to convert the function m into a continuous one.We take m(t′) = m(t′′) if m(t′) = α · m(t′′), where α is apositive real number. More specifically, we want to find allpairs t′ = βti+(1−β)ti+1 and t′′ = γtj+(1−γ)tj+1 suchthat βm(ti)+(1−β)m(ti+1) = α (γm(tj) + (1 − γ)m(tj+1)),where α > 0, 0 ≤ β ≤ 1, 0 ≤ γ ≤ 1, and ti, ti+1, tj , tj+1

are the times of the samples.

Then, we compute the orientation represented by quater-nion. We initialize the quaternion q(t0) = (1, 0, 0, 0)T , andupdate the quaternion for each gyroscope sample ω(ti) se-quentially. Here, since the time between two consecutivegyroscope samples is quite small, the rotation is treated as auniform angular velocityω(ti). In this approximation, Equa-tion 1 is refined to Equation 2.

q(ti+1) =(

cos(|ω(ti)|∆t/2)I

+2 sin(|ω(ti)|∆t/2)

|ω(ti)|Fq(ω(ti))

)

q(ti),(2)

where ∆t = ti+1 − ti.

Next, for the i-th intersection (t′i, t′′

i ), we compute the rota-tion q(t′′i )q(t

i)−1 between them, and find the rotation axis

ui and the rotation angle θi. Ideally, ui should point northin the absence of error. Then, we compute the average rota-tion axis u of all ui’s, weighted by (1 − cos θi). Finally wenormalize u. The variation of the all the u’s is defined by∑

u(1−u · u), where (1−u · u) is the square of the distancebetween u and u on the unit sphere.

Finally, we find K and β which minimize the variation byusing the Nelder–Mead method (or downhill simplex method),starting with K = I and β = (0, 0, 0)T . In each iteration ofthe Nelder-Mead method, we have to recompute the orienta-tion, but do not need to recompute the intersections.

Camera Path Tracking

In this part, we utilize the sensors to track the movement ofthe camera. The position of a rigid body is represented bythe combination of its linear position (or displacement) andits angular position (or orientation).

Orientation

The angular velocity ω = (ωx, ωy, ωz)T measured by gyro-

scope of the mobile phone is used to track the orientation,which can be represented in several ways, e.g., via rota-tion matrix, Euler angle, quaternion and so on, as discussedin Section . We use quaternion in this work because it issimpler to compose and avoids the problem of gimbal lock

23

Page 4: Inertial sensors aided image alignment and stitching for

compared to the Euler angle, and is more numerically stablecompared to the rotation matrix. The quaternion is also up-dated by the angular velocity given by calibrated gyroscopeas stated in Equation 2.

In order to determine the initial quaternion, the phone is re-quired to be static for a fixed period of time. In the staticstate, the accelerometer can derive the direction of gravityand magnetometer can derive the north, which can thereforeprovide the initial orientation of the phone.

Displacement

Accelerometers in mobile phones give the acceleration co-ordinate in the phone reference coordinate system (denotedas ap). However, the calculation of displacement requiresacceleration in an absolute reference coordinate system (de-noted as aa). aa is obtained from ap using Equation 3:

aa = Rap, (3)

where R is the rotation matrix. Gravity can be directly sub-stracted from aa to isolate only that acceleration caused bycamera motion. Hereafter, a is used to represent the accel-eration without gravity in the absolute reference coordinatesystem.

Generally speaking, displacement can be calculated by ap-plying a second-order numerical integration algorithm to a.However, calibration is only able to decrease bias and scalefactor errors, leaving other errors such as nonlinear ones un-resolved. These small unaccounted errors in the measure-ment of acceleration and angular velocity are integrated intoprogressively larger errors in velocity, which are then fur-ther compounded into even greater errors in displacement.Therefore, we design a pipeline of three steps including ahigh pass filter, a zero velocity revision and a noise gatewhich we employ during the integration process to finallyreduce the calibration-avoiding errors in the displacement.

Step 1. High Pass Filter

Because the biases of the gyroscope are relative to tempera-ture, the calibration procedure cannot totally eliminate them.This makes the rotation matrix tilt the phone’s orientation.As a result, gravity still influences the acceleration measuredalong the horizontal plane (i.e., along the x-y axes). The highpass filter aims at decreasing such influences. If the accel-eration a is fed to the filter as a sequence, we let ai denotethe ith acceleration update. The output of the high pass filterHPF(ai) is calculated using Equation 4:

HPF(ai) = ai − LPF(ai)

= ai − [(1− e−∆tβ )ai − e−

∆tβ LPF(ai−1)]

= e−∆tβ ai + e−

∆tβ LPF(ai−1)

(4)

where LPF(ai) is the result of ai passing through a low passfilter determined by an attenuation factor β, and ∆t is thetime interval between ai and ai−1.

Figure 1 shows the effect of the high pass filter defined inEquation 4, where β is set to be 1 second. Obviously, the

-5

-4

-3

-2

-1

0

1

2

3

5 10 15 20 25 30 35 40

acce

lera

tion (m

/s2)

time (sec)

axis-1axis-1 HPF

axis-2aixs-2 HPF

Figure 1. The effect of a high pass filter acting on the acceleration of

Figure 8’s camera taking a 10 picture photo shoot. The vertical black

lines mark the time the photos were taken. The green line shows thered line signal passing through the high pass filter, and the pink line

shows the blue line signal passing through the high pass filter.

accelerations along the two orthogonal horizontal axes areboth drawn back to near zero.

Step 2. Zero Velocity Revision

The output of the high pass filter still has errors. As shownin Figure 1, the adjusted acceleration is still a approximatelynon-zero constant even if the camera is at rest. This calcu-lated velocity results in deviating away from zero with timeas shown in Figure 2. In this step, we remove this kind oferror based on the assumption that the velocity of the cam-era is small when the user takes photos. This assumption isresealable for cell-phones cameras because the photo will beblurry if the phone takes it with a fast movement.

We denote the moment when the shutter opens on the i-thphoto by t(i), and assume the velocity at t(i) is small. We

want to add a constant c(i) to the accelerations between t(i)

and t(i+1) such that the velocity at t(i+1) is zero. This willrepresent the error in the acceleration that propagates intocalculating the velocity gives a linear drift (see Figure 2). Inthis constant error assumption, we have the equation:

∫ t(i+1)

t(i)(aH(t) + c(i))dt = 0,

where aH(t) is the output acceleration of the high pass filter.When we solve the equation, we have

c(i) = −∫ t(i+1)

t(i)aH(t)dt

t(i+1) − t(i).

Therefore, our algorithm is as follows: we integrate the ac-

celerations to compute the v(i+1) as before (possibly non-

zero). Then we subtract −c(i) = v(i+1)/(t(i+1) − t(i)) from

the accelerations between t(i) and t(i+1). Finally, we com-pute the velocity again.

Figure 2 shows the velocities. The green and pink lines aremuch more accurate than the the original red and blue lines.

24

Page 5: Inertial sensors aided image alignment and stitching for

-5

-4

-3

-2

-1

0

1

2

3

5 10 15 20 25 30 35 40

vel

oci

ty (m

/s)

time (sec)

axis-1axis-1 zero-v

axis-2aixs-2 zero-v

Figure 2. The effect of zero velocity revision in experiment shown in

Figure 8. The red and blue lines are the velocities in two axis output by

the high pass filter, corresponding to the red and blue lines in Figure 1.We reset them when taking photos, and show them by the green and

pink lines respectively. The vertical black lines mark the time points of

taking photos.

Step 3. Noise Gate

In the beginning, the measured acceleration a(t) is corruptedwith errors, so we apply a high pass filter on it. The result-ing acceleration aH(t) seems good, but the velocity derivedfrom it deviates from zero. As a result, we “reset” the ve-locity at snapshot points produce a more accurate velocityapproximation. Now, we come to final step of calculatingthe displacement via the integration of velocity, illustratedby the red and blue lines in Figure 3. The approximated dis-placements stray away from zero with time because, just likethe non-zero acceleration causes a deviation of the velocityin Section , the velocity is non-zero when the camera is ac-tually still. From Figure 2, we see that the velocity is almostzero most of the time. So, in this step, we set the velocity tozero when it is sufficiently small.

A noise gate is parameterized by an open threshold, a closethreshold and a hold time. The noise gate opens when the in-put exceeds the open threshold, and it closes when the inputis smaller than the close threshold or the input runs belowthe open threshold for hold time.

In Figure 3, we can see that the noise gate approximated dis-placement (shown as the green and pink lines) keeps whenthe noise gate is closed. This results in a more authenticapproximation.

Prediction of Displacement between Photos

In this part, we make use of our approximations about themovement of the camera to compute the displacement of thephoto in pixels.

When the i-th photo is taken, we denote the position of thephone by pi, the orientation matrix by Ai, and the distancebetween the scenery and the phone by di. Without loss ofgenerality, we assume the camera points at r = (0, 0, 1)T inthe reference system of the phone.

The position of the object the camera points to is Aidir+pi.

-1

-0.5

0

0.5

1

5 10 15 20 25 30 35 40

dis

pla

cem

ent (

m)

time (sec)

axis-1axis-1 noise gate

axis-2aixs-2 noise gate

Figure 3. The effect of applying the noise gate. The red and blue lines

are the displacements along the two non-vertical axes. We apply the

noise gate on the velocity and recompute the displacement, shown bythe green and pink lines. We choose the high threshold to be 0.04m/s,

the low threshold to be 0.01m/s, and the hold time to be 0.5s.

If we take two successive photos, which are very similar buthave an offset, we call this offset the displacement betweenthe two photos. Formally speaking, assume a real-world ob-ject is projected onto (x1, y1) in the first image, and onto(x2, y2) in the second image. This insinuates that the objectlies somewhere in the overlapped area of the two images.We define the displacement of the two successive images as(x1 − x2, y1− y2).

We choose the placement of the phone during the first photoas the starting reference, so A1 = I and p1 = 0. We assumethe two photos overlaps and the movement of the phone issmall. When the object is far from the camera, we can as-sume that d1 = d2. Denote the horizontal view angle of thecamera by θ and the width of the photo by w. Consider asimple case where the phone does not rotate around r, i.e.the eigenvector of A2 is orthogonal to r. In the scenario, thedisplacement of the two photos can be estimated by the xand y coordinates of the vector

A2d2r + p2d2

· w

2 tan θ2

.

r

A2d2r + p2

θ/2w/2

Figure 4. Displacement between photos

Alignment Speed Up

Image alignment is used to discover the correspondence re-lationship among overlapped images. Pixel-based alignmentand feature-based alignment are two popular of approaches.

Pixel-based alignment attempts to minimize the distance be-tween two images, where the distance is defined to be the

25

Page 6: Inertial sensors aided image alignment and stitching for

summation of the distances between each pair of pixels. For-mally speaking, for two images I1 and I2 of size w × h, letIi(x, y) be the color of the (x, y) pixel. The distance be-tween the two images is

dist(I1, I2) =∑

(x,y)∈I1∩I2

dist(I1(x, y), I2(x, y)).

The distance between two colors can be defined in manyways. For example, for gray images, we can define the dis-tance between two colors c1 and c2 in the following ways:

dist(c1, c2) = |c1 − c2|,

dist(c1, c2) = (c1 − c2)2,

dist(c1, c2) =

1 if |c1 − c2| > a

0 otherwise,

or

dist(c1, c2) =(c1 − c2)

2

1 + (c1 − c2)2/a2.

A typical algorithm is to try every offset (u, v) between thetwo images, compute the distance, then choose the best one.More specifically, we want to minimize the average distance

(x,y)∈J dist(I1(x, y), I2(x− u, y − v))

|J | ,

where J is the intersection between I1 and I2 moved by(u, v), and −w ≤ u ≤ w, −h ≤ v ≤ h.

We can speed up the search process if we have a good pre-diction of the displacement. We first search (u, v) aroundour prediction. If our prediction is not far from the real dis-placement, we will find the best alignment after only a shorttime.

Feature-based alignment is more robust than pixel-base ones.There are many detectors and descriptions of the features,e.g. Harris [6], FAST [8, 9], SIFT [7] and SURF [3]. All ofthese employ two steps: feature detection and feature match-ing. Feature detection is to find the key points and computetheir descriptions. One surmises the description of a keypoint does not change too much between two images thatboth contain the key point. In the feature matching step, wesearch for pairs of key points which have similar descrip-tions.

With a prediction of the displacements, both feature detec-tion and feature matching can be performed faster. In thefeature detection step, we do not need to find key pointsover the whole image if we have a prediction of the align-ment. Instead we can focus on finding only those key pointsthat lay in the overlapped regions. The less the images over-lap, the faster feature detection can be accomplished. In thematching step, we can speed up once again. We do not com-pare each pair of key points. Instead, within the overlappingregion, we only try to match those pairs within the overlap-ping region that lie near our prediction of their displacement.

This allows us to reduce the comparisons of key points sig-nificantly.

Furthermore, if we have more images to align, the displace-ment given by sensors can give more help. The predicted dis-placements tell us which pairs of images overlap, and whichdoes not. Typically, the images are sparse (each image in-tersects with a constant number of other images), which al-lows us to reduce the number of pairs of images to com-pare from Ω(n2) to O(n), where n is the number of im-ages. Using INS, we can speed up the alignment progressand augment the correctness of the panorama, especially inthe pixel-based approach.

EXPERIMENT RESULTS

In the experiments, we use Nexus S[1], an Android mobilephone equipped with a 3-axis gyroscope, an accelerometer,a magnetometer, and a front and rear facing camera. TheAndroid API gives the view angle byCamera.Parameters.getHorizontalViewAngle() orCamera.Parameters.getVerticalViewAngle().

We not only show the results of distant views (Figures 5, 6,and 7), but also those of close shots (Figures 8 and 9). Usingsensors to predict movement does indeed speed the align-ment in both the SIFT featured-based approach (Figures 5,6, 7, and 8) and the pixel-based approach (Figure 9).

Accuracy of Prediction of Movement

Figure 5(c), Figure 6(b) and Figure 7(b) show the camerapaths estimated by sensors when compared to the real paths.The red crosses mark the estimated camera path, the bluesquares are the estimated displacements on this path, andthe filled squares are the “real” displacement of the camerafrom the stitched image, as computed by the alignment algo-rithm. All of these paths start at (0, 0). Figures 8(c) and 9(b)show the estimated camera path when taking close shots.Figure 8(c) shows the estimated displacement of the camerain meters. Figure 9(b) shows the displacement in pixels. Wefirst calculate the displacement in meters and convert it intopixels by using the estimated distance between the sceneryand the camera.

Because Figures 5(b), 6(a), and 7(a) are distant views, themoving paths in these views are more sensitive to camerarotation as opposed to the translational motion. On the otherhand, Figures 8(b) and 9(a) are close shots, and camera ro-tation is minimal (otherwise there is a large change in thedepth of field, the processing of which is left as future work.)As a result, the moving path for close panoramas is sensitiveto the translational motion of the camera. The accelerometerplays an important roll in these both cases.

In Figures 7(b) and 8(c), the moving paths of the camera areboth long. Predicted displacement errors grow with time, be-cause of cumulative effects in the integration process. Thesecumulative errors are manageable, because photos are alignedsequentially, and only the displacement between two succes-sive photos is used. Also, the displacement calculated byalignment can help to adjust the estimated displacement.

26

Page 7: Inertial sensors aided image alignment and stitching for

(a) The original four photos.(b) The SIFT-based stitching result offour photos.

-700

-600

-500

-400

-300

-200

-100

0

100

200

-200 0 200 400 600 800 1000 1200

pix

els

pixels

camera-pathestimated displacement

real displacement

(c) The estimated displacements between thecenter of each photo, starting from (0,0). Theblue empty squares and the black filled squaresmark the estimated displacements via signalsfrom INS, and the “real” displacement calcu-lated by SIFT-based alignment, respectively.The red crosses show the estimated path of thewhole procedure while taking the four photos.

Figure 5. Four photos about distance views.

(a) The SIFT-based stitching result of six photos.

-500

-400

-300

-200

-100

0

100

-200 0 200 400 600 800 1000

pix

els

pixels

camera-pathestimated displacement

real displacement

(b) The estimated displacements between the center of each photo,starting from (0,0). The legends have the same meaning as Fig-ure 5(c).

Figure 6. Six photos about distance views.

(a) The SIFT-based stitching result of 15 photos.

0

200

400

600

800

1000

1200

-600 -400 -200 0 200 400 600 800 1000

pix

els

pixels

camera-pathestimated displacement

real displacement

(b) The estimated displacements between the center of each photo,starting from (0,0). The legends have the same meaning as Fig-ure 5(c). Different from Figure 5(c) and Figure 6(b), the 15 photosare taken in a free-hand-moving way.

Figure 7. Fifteen photos about distance views.

27

Page 8: Inertial sensors aided image alignment and stitching for

(a) The original ten photos.

(b) The SIFT-based stitching result of 10 photos.

-0.18

-0.16

-0.14

-0.12

-0.1

-0.08

-0.06

-0.04

-0.02

0

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

met

ers

meters

(c) The estimated movement path of the camera, starting from (0,0).The squares marks the positions of the camera while taking photos.

Figure 8. Ten close shots photos.

Speed Improvement

Table 1 shows that we can save time when we utilized theinformation provided by the sensors in the phone.

However, the largest amount of time is saved in Figure 9(a)with the pixel-based method. This is mainly due to searchspace being dramatically decreases because of our accuratedisplacement prediction, as shown in Figure 9(b).

The speed in computing Figure 5(b) is greatly improved aswell. In the SIFT feature-based method, searching for keypoints and computing their descriptions are the most timeconsumption steps in the whole procedure. In this example,the overlapped area is small, so we only need to find the keypoints in a markedly reduced area.

REMARKS AND FUTURE WORK

In Table 1, the running time of our program is still slow be-cause the users have to wait several seconds for taking onephoto. In the future we can improve the running time if wechoose smaller image sizes, adjust parameters of SIFT, orchange from SIFT to other features-based methods such asSURF. We believe that no matter what the alignment method

(a) The pixel-based stitching result of 2 photos.

-60

-40

-20

0

20

40

0 50 100 150 200 250

pix

els

pixels

(b) The estimated displacements between the center of each photo,starting from (0,0). The squares marks the center of photos.

Figure 9. Two close shots photos.

used, the predicted moving path given by the sensors canspeed up the alignment process.

In the Experiment Results section, we only test the trans-lational motion of camera in close shots. If the camera isrotated when taking consecutive pictures at close range, itis hard for us to maintain the orientation of the camera pre-cisely and in real time, so we cannot isolate and remove thegravity from the acceleration readings very easily. Since thegravity is much greater than the acceleration of the motionof the camera, the result is a large error in predicted dis-placement. Therefore, we wish to find better approaches tocompute the orientation and displacement of the camera incases that involve complicated motion.

Figurenumber

Method Timewithoutsensors

Timewithsensors

Rateof de-crease

6(a)feature-based

19.9s 18.4s 7.5%

5(b)feature-based

11s 6.3s 43%

7(a)feature-based

69s 50s 28%

8(b)feature-based

49s 43s 12%

9(a)pixel-based

7.3s 1.7s 77%

Table 1. Comparision of creating panorama with/without sensors.

28

Page 9: Inertial sensors aided image alignment and stitching for

REFERENCES

1. Nexus S. http://www.google.com/nexus/#.

2. D. Allan and J. Barnes. A modified Allan variance withincreased oscillator characterization ability. In ThirtyFifth Annual Frequency Control Symposium, pages470–475. IEEE, 1981.

3. H. Bay, T. Tuytelaars, and L. J. V. Gool. Surf: Speededup robust features. In A. Leonardis, H. Bischof, andA. Pinz, editors, ECCV (1), volume 3951 of LectureNotes in Computer Science, pages 404–417. Springer,2006.

4. Y. Gao, Q. Yang, G. Li, E. Chang, D. Wang, C. Wang,H. Qu, P. Dong, and F. Zhang. XINS: The Anatomy ofan Indoor Positioning and Navigation Architecture(unpublished). In The First International Workshop onMobile Location-Based Service. ACM, 2011.

5. S. Ha, H. Koo, S. Lee, N. Cho, and S. Kim. Panoramamosaic optimization for mobile camera systems.Consumer Electronics, IEEE Transactions on,53(4):1217–1225, 2007.

6. C. Harris and M. Stephens. A combined corner andedge detector. In Alvey vision conference, volume 15,page 50. Manchester, UK, 1988.

7. D. G. Lowe. Object recognition from localscale-invariant features. In ICCV, pages 1150–1157,1999.

8. E. Rosten and T. Drummond. Fusing points and linesfor high performance tracking. In IEEE InternationalConference on Computer Vision, volume 2, pages1508–1511, October 2005.

9. E. Rosten and T. Drummond. Machine learning forhigh-speed corner detection. In European Conferenceon Computer Vision, volume 1, pages 430–443, May2006.

10. Y. Xiong and K. Pulli. Fast image stitching and editingfor panorama painting on mobile phones. In ComputerVision and Pattern Recognition Workshops (CVPRW),2010 IEEE Computer Society Conference on, pages47–52. IEEE, 2010.

29