real-time feature-based video mosaicing at 500 fps

8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps

1/6

Real-time Feature-based Video Mosaicing at 500 fps

Ken-ichi Okumura, Sushil Raut, Qingyi Gu, Tadayoshi Aoyama, Takeshi Takaki, and Idaku Ishii

Abstract We conducted high-frame-rate (HFR) video mo-saicing for real-time synthesis of a panoramic image by imple-menting an improved feature-based video mosaicing algorithmon a field-programmable gate array (FPGA)-based high-speedvision platform. In the implementation of the mosaicing algo-rithm, feature point extraction was accelerated by implementinga parallel processing circuit module for Harris corner detectionin the FPGA on the high-speed vision platform. Feature pointcorrespondence matching can be executed for hundreds ofselected feature points in the current frame by searching thosein the previous frame in their neighbor ranges, assumingthat frame-to-frame image displacement becomes considerablysmaller in HFR vision. The system we developed can mosaic512512 images at 500 fps as a single synthesized image inreal time by stitching the images based on their estimated

frame-to-frame changes in displacement and orientation. Theresults of an experiment conducted, in which an outdoor scenewas captured using a hand-held camera-head that was quicklymoved by hand, verify the performance of our system.

I. INTRODUCTION

Video mosaicing is a video processing technique in which

multiple images are merged into a composite image that

covers a larger, more seamless view than the field of view

of the camera. Video mosaicing has been used to generate

panoramic pictures in many applications such as tracking,

surveillance, and augmented reality. Most video mosaicing

methods [1] determine the transform parameters between

images by calculating feature points using feature extractors

such as the KLT tracker [2] and SIFT features [3] to match

feature point correspondences between images. Many real-

time video mosaicing approaches for fast image registration

have been proposed. Kourogi et al. proposed a fast image reg-

istration method that uses pseudo motion vectors estimated

as gradient-based optical flow [4]. Civera et al. developed

a drift-free video mosaicing system that obtains a consistent

image from 320240 images at 30 fps using an EKF-SLAMapproach when previously viewed scenes are revisited [5].

de Souza et al. performed real-time video mosaicing without

over-deformation of mosaics using a non-rigid deformation

model [6]. Most of these proposals only apply to standardvideos at relatively low frame rates (e.g., NTSC 30 fps).

Further, the camera has to be operated slowly, without any

large displacement between frames, in order to prevent diffi-

culties in matching feature point correspondences. However,

in scenarios involving rapid camera motion, video images

at such low frame rates are insufficient for tracking feature

points in the images. Thus, there is a demand for real-time

K. Okumura, S. Raut, Q. Gu, T. Aoyama, T. Takaki, and I. Ishii. are withHiroshima University, Hiroshima 739-8527, Japan (corresponding author (I.Ishii) Tel: +81-82-424-7692; e-mail: [email protected]).

high-frame-rate (HFR) video mosaicing at rates of several

hundred frames per second and greater.

Many field-programmable gate array (FPGA)-based HFR

vision systems have been developed for real-time video

processing at hundreds of frames per second and greater

[7][9]. Although HFR vision systems can simultaneously

calculate scalar image features, they cannot transfer the entire

image at high speed to a personal computer (PC) owing to the

limitations in the transfer speed of its inner bus. We recently

developed IDP Express [10], an HFR vision platform that

can simultaneously process an HFR video and directly map

it onto memory allocated in a PC. Implementing a real-

time function to extract feature points in images and matchtheir correspondences between frames on such an FPGA-

based high-speed vision platform would make it possible to

accelerate video mosaicing at a higher frame rate than 30

fps, even when the camera moves quickly.

In this study, we developed a real-time video mosaicing

system that operates 512512 images at 500 fps. The sys-tem can simultaneously extract and enable correspondence

between feature points in images for estimation of frame-

to-frame displacement, and stitch the captured images into a

panoramic image to give a seamless wider field of view.

I I . ALGORITHM

Most video mosaicing algorithms are realized by executingthe following processes [1]: (1) feature point detection, (2)

feature point matching, (3) transform estimation, and (4)

image composition. In many applications, image composition

at dozens of frames per second is sufficient for the human

eye to monitor a panoramic image. However, feature tracking

at dozens of frames per second is not always accurate when

the camera is moving rapidly, because most feature points

are mistracked when image displacements between frames

become large. In HFR vision, frame-to-frame displacement

in feature points becomes considerably smaller. Narrowing

the search range by assuming HFRs reduces the inaccuracy

and computational load needed to match feature points

between frames even when hundreds of feature points areobserved from a rapidly moving camera, whereas image

composition at an HFR requires a heavy computational load.

We introduce a real-time video mosaicing algorithm that

can resolve this trade-off between tracking accuracy and

computational load in video mosaicing for rapid camera

motion. In our algorithm, steps (1)(3) are executed at an

HFR on an HFR vision platform, and step (4) is executed

at dozens of frames per second to enable the human eye

to monitor the panoramic image. In this study, we focus on

acceleration of steps (1)(3) for HFR feature point tracking

2013 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS)November 3-7, 2013. Tokyo, Japan

978-1-4673-6357-0/13/$31.00 2013 IEEE 2665


2/6

at hundreds of frames per second or more on the basis of

the following concepts:

(a) Feature point detection accelerated by hardware logic

Generally, feature point detection requires pixel-level com-

putation of the order O(N2), where N2 is the imagesize. Gradient-based feature point detection is suitable for

acceleration by hardware logic because the local calculation

of brightness gradients can be easily parallelized as pixel-level computation. In this study, we accelerate gradient-based

feature point detection by implementing its parallel circuit

logic on an FPGA-based HFR vision platform.

(b) Reduction in the number of feature points

Thousands of feature points are not always required for

estimating transform parameters between images in video

mosaicing. The computational load can be reduced by se-

lecting a smaller number of feature points because feature-

level computation of the order O(M2) is generally requiredin feature point matching; M is the number of selected

feature points. In this study, the number of feature points used

in feature point matching is reduced by excluding closely

crowded feature points as they often generate correspondingerrors between images in feature point matching.

(c) Narrowed search range by assuming HFR video

Feature point matching still requires heavy computation

of the order O(M2) if it is required that all the featurepoints correspond to each other between frames, even when a

reduced number of feature points are used. In an HFR video,

it can be assumed that frame-to-frame image displacement

grows considerably smaller, which allows a smaller search

range to be used for points corresponding to points in the

preceding frame. This narrowed search range reduces the

computational load of feature point matching to correspond-

ing feature points between frames on the order ofO(M).

Our algorithm for real-time HFR video mosaicing consists

of the following processes:

(1) Feature point detection

(1-a) Feature extraction

To extract feature points such as upper-left vertexes, we

use the following brightness gradient matrix C(x, t):

C(x, t) =

xNa(x)

I2x(x, t) I

x(x, t)I

y(x, t)Ix(x, t)I

y(x, t) I2y (x, t)

, (1)

where Na(x) is the a a adjacent area of pixel x = (x, y),and Ix(x, t) and I

y(x, t) indicate the following positivevalues ofIx(x, t) and Iy(x, t), respectively:

I(x, t) =

I(x, t) (I(x, t)> 0)

0 (otherwise) (= x, y), (2)

where Ix(x, t) and Iy(x, t) are x and y differentials of theinput image I(x, t) at pixel x at time t.

(x, t) is defined as a feature for feature tracking usingthe Harris corner detection [11] as follows:

(x, t) = det C(x, t) (TrC(x, t))2. (3)

Here, is a tunable sensitive parameter, and values in the

range 0.040.15 have been reported as feasible.

(1-b) Feature point detection

Thresholding is conducted for(x, t)with a threshold Tto obtain a map of feature points, R(x, t), as follows:

R(x, t) =

1 ((x, t)> T)0 (otherwise)

. (4)

The number of feature points when R(x, t) = 1 is countedin the p p adjacent area ofx as follows:

P(x, t) =

xNp(x)

R(x, t), (5)

where P(x, t) indicates the density of the feature points.

(1-c) Selection of feature points

To reduce the number of feature points, closely crowded

feature points are excluded by counting the number of feature

points in the neighborhood. The reduced set of feature points

R(t) is calculated at time t as follows:

R(t) = {x | P(x, t) P0} , (6)

whereP0 is a threshold used to sparsely select feature points.

We assume that the number of feature points is less than M.

(2) Feature point matching

(2-a) Template matching

To enable correspondence between feature points at the

current time t and those at the previous time tp= t nt,template matching is conducted for all the selected feature

points in an image. t is the shortest frame interval of avision system, and n is an integer. For template matching to

enable the correspondence of the i-th feature point at timetpbelonging toR(tp),xi(tp) (1 i M), to thei

-th feature

point at time t belonging to R (t), xi(t) (1 i M), the

sum of absolute difference (SAD) is calculated as follows:

E(i, i; t, tp) =

=(,)Wm

|I(xi(t)+, t) I(xi(tp)+, tp)|, (7)

where Wm is an m m window of template matching.To reduce the number of mismatched points, x(xi(tp); t),

which indicates the feature point at time t corresponding to

the i-th feature point xi(tp) at time tp, and x(xi(t); tp),which indicates the feature point at time tp corresponding

to the i-th feature point xi(t) at time t, are bidirectionallysearched by selecting the feature points when E(i, i; t, tp)is at a minimum in their adjacent areas as follows:

x(xi(tp); t) = xi(i)(t) = arg minxi (t)Nb(xi(tp))

E(i, i; t, tp), (8)

x(xi(t); tp) = xi(i)(tp) = arg minxi(tp)Nb(xi (t))

E(i, i; t, tp), (9)

wherei(i) and i(i)are determined as the index numbers ofthe feature point at time t, corresponding to xi(tp), and thefeature point at time tp, corresponding toxi(t), respectively.The pair of feature points between time t and tp are selected

when the corresponding feature points are mutually selected

as the same points as follows:

x(xi(tp); t) =

x(xi(tp); t) (i= i(i

(i))) (otherwise)

, (10)

2666


3/6

where indicates that no feature point at time t correspondsto the i-th feature point xi(tp) at time tp, and the featurepoints at time t that have no corresponding point at time tp,

are adjudged to be newly appearing feature points.

We assume that the image displacement between times t

and tp is small, and the feature point xi(t) at time t can bematched with a feature point at time tp in the b b adjacentarea ofxi(t). The processes described in Eqs. (8)(10) areconducted for all the feature points belonging to R(tp) andR(t), and we can reduce the computational load of featurepoint matching in the order ofO(M) by setting a narrowedsearch range with a small value for b.

(2-b) Frame interval selection

To avoid accumulated registration errors caused by small

frame-to-frame displacements, the frame interval of feature

point matching is adjusted for feature point matching at time

t + t using the following averaged distance among thecorresponding feature points at times t and tp:

d(t; tp) =

xi(tp)

Q(t;tp)

|x(xi(tp); t)xi(tp)|

S(Q(t; tp)) , (11)

where Q(t; tp) is a set of the feature points xi(tp) at timetp that satisfy x(xi(tp); t)= (i= 1, , M). S(Q(t; tp))is the number of elements belonging to Q(t; tp).

Depending on whether d(t; tp) is larger than d0, n = n(t)is determined for feature point matching at time t + t; theframe interval of feature point matching is initially set tot.

(2-b-i) d(t; tp) d0,The image displacement between times t and tp= tp(t)is

too small for feature point matching. To increase the frame-

to-frame displacement at the next time t + t, the parameter

n(t + t) is incremented by one as follows:n(t + t) = n(t) + 1. (12)

This indicates that the feature points at time t+ t arematched with those at time tp(t+t) = tn(t)t. Withoutconducting any process in steps (3) and (4) below, return to

step (1) for the next time interval t + t.

(2-b-ii) d(t; tp)> d0To reset the frame-to-frame displacement at time t + t,

n(t + t) is set to one as follows:

n(t + t) = 1. (13)

This indicates that the feature points at time t+ t arematched with those at time tp(t + t) = t. Thereafter, go tosteps (3) and (4); the selected pairs of feature points, xi(t)(Q(t; tp)) and x(xi(t); tp), are used in steps (3) and (4).

(3) Transform estimation

Using the selected pairs of feature points, affine parameters

between the two images at times t and tp are estimated as

follows:

xTi (t) =

a1 a2a3 a4

xT(xi(t); tp) +

a5a6

, (14)

= A(t; tp)xT(xi(t); tp) +b

T(t; tp), (15)

where aj (j = 1, , 6) are components of the transformmatrixA(t; tp)and translation vectorb(t; tp)that express theaffine transform relationship between the two images at times

t and tp. In this study, they are estimated by minimizing the

following Tukeys biweight evaluation functions,E1 and E2:

El=

xi(tp)Q(t;tp)

wli d2li (l= 1, 2). (16)

Deviations d1i and d2i are given as follows:

d1i= |x

i (a1x

i+ a2y

i+ a5)|, (17)

d2i= |y

i (a3x

i+ a4y

i+ a6)|, (18)

where x(xi(t); tp) = (x

i, y

i) and x

i(t) = (x

i , y

i) indicatethe temporally estimated location of the i-th feature point at

time t in Tukeys biweight method. The following processes

are iteratively executed to reduce estimation errors caused

by distantly mismatched pairs of feature points. Here, the

weights w1i and w2i are set to one and x

i(t) set to xi(t)as their initial values.

(3-a) Affine parameter estimation

On the basis of the least squares method to minimize E1andE2, affine parametersaj(j = 1, , 6)are estimated bycalculating the weighted product sums of the xy coordinates

of the selected feature points as follows:

a1a2

a5

=

i

w1ix2i

i

w1ix

iy

i

i

w1ix

ii

w1ix

iy

i

i

w1iy2i

i

w1iy

ii

w1ix

i

i

w1iy

i

i

w1i

1

i

w1ix

i x

ii

w1ix

i y

ii

w1ix

i

, (19)

a3a4a6

=i

w2ix2i

i

w2ix

iy

ii

w2ix

ii

w2ix

iy

i

i

w2iy2i

i

w2iy

ii

w2ix

i

i

w2iy

i

i

w2i

1

i

w2iy

ix

ii

w2iy

iy

ii

w2iy

i

. (20)

Using the affine parameters estimated in Eqs. (19) and

(20), the temporally estimated locations of feature points can

be updated as follows:

xTi (t) =

a1 a2a3 a4

xT(xi(t); tp) +

a5a6

. (21)

(3-b) Updating weights

To reduce the errors caused by distantly mismatched pairs

of feature points, w1i and w2i are updated for the i-th pairof feature points using the deviations d1i and d2i as follows:

w1iw2i

=

1 d21iW2u

2

1 d22iW2u

2 (d1i Wu, d2i Wu)

00

(otherwise)

. (22)

Wu is a parameter used to determine the cut point for

evaluation functions at u time iteration. As the number of

2667


4/6

iterationsu is larger, Wu is set to a smaller value. Increment

u by one and return to step (3-a) during u U.

AfterU time iteration, the affine parameters between two

input images at times t and tp are determined, and the

affine transform matrix and vector are obtained. A(t; tp)andb(t; tp)are accumulated with the affine parameters estimatedat time tp as follows:

A(t) = A(t; tp)A(tp), (23)

bT(t) = A(tp)bT(t; tp)+b

T(tp), (24)

where A(t) and b(t) indicate the affine parameters of theinput image at time t, compared with the input image at

time 0; A(0) and b(0) are initially given as the unit matrixand zero vector, respectively.

(4) Image composition

To monitor the panoramic image at intervals oft, thepanoramic image G(x, t) is updated by attaching an affine-transformed input image at time t to the panoramic image

G(x, t t) at time t t as follows:

G(x, t) =

I(x

, t) (ifI(x

, t)=)G(x, t t) (otherwise)

, (25)

where the panoramic image G(x, t) is initially given as anempty image at time 0; x indicates the affine-transformed

coordinate vector at time t as follows:

xT =A(t)xT +bT(t). (26)

Our video mosaicing algorithm can be efficiently executed

as a multi-rate video processing method. The interval in steps

(1) and (2) can be set to the shortest frame interval t foraccurate feature point tracking in HFR vision, while that in

step (3) depends on the frame-to-frame image displacement;

it is the shortest frame interval when there is significant cam-era motion, and it becomes larger when the camera motion

becomes smaller. On the other hand, the interval in step (4),

t, can be independently set to dozens of millisecond toenable the human eye to monitor the panoramic image.

III. SYSTEM I MPLEMENTATION

A. High-speed vision platform: IDP Express

To realize real-time video mosaicing at an HFR, our

video-mosaicing method was implemented on the high-speed

vision platform IDP Express [10]. The implemented system

consisted of a camera head, a dedicated FPGA board (IDP

Express board), and a PC. The camera head was highly

compact and could easily be held by a human hand. Itsdimensions and weight were 232377 mm and 145 g,respectively. On the camera head, eight-bit RGB images

of 512512 pixels could be captured at 2000 fps with aBayer filter on its CMOS image sensor. The IDP Express

board is designed for high-speed processing and recording

of 512512 images transferred from the camera head at2000 fps. On the IDP Express board, hardware logic can be

implemented on a user-specified FPGA (Xilinx XC3S5000-

4FG900) for acceleration of image processing algorithms.

The 512512 input images, and the processed results on the

thresholding

submodule

grayimage

4 parallel

color-to-gray

convertor

multiplication

&

summation

gradientmatrix C

4 parallel 4 parallel

trace&

determinant

2'xI

2'y

I

yx II ''

4 parallel

20bit34

featuremultiplication

&

subtraction

CTr

Cdet

16 parallel

35bitx4

36bit435bitx4

Harris Corner Detector Submodule

8bitx4

),( tF x

colorinputimage

),( tI x8bitx4

featurepoint

counter

4 parallel

),( tx

featurepointmap

1bitx4

),( tB x ),( tP x8bitx4

numberof

featurepoints

8bitx4

),( tF x

colorinputimage

Feature Point Extraction Circuit Module

Fig. 1. Schematic data flow of a feature point extraction circuit module.

IDP Express board, can be memory-mapped via 16-lane PCI-

e buses in real time at 2000 fps onto the allocated memories

in the PC, and arbitrary processing for the execution of

software on the PC can be programmed with an ASUSTek

P5E mainboard, Core 2 Quad9300 bulk 2.50 GHz CPU,

4 GB memory, and a Windows XP Professional 32-bit OS.

B. Implemented hardware logic

To accelerate steps (1-a) and (1-b) in our video mosaic-ing method, we designed a feature point extraction circuit

module in the user-specific FPGA on the IDP Express;

steps (1-a) and (1-b) have a computational complexity of

O(N2). The other processes (steps (1-c), (2), (3), and (4))were implemented in software on the PC because, with their

computational complexity ofO(M), they can be acceleratedby reducing the number of feature points.

Figure 1 is a schematic data flow of the feature point

extraction circuit module implemented in the FPGA on the

IDP Express board. It consists of a color-to-gray converter

submodule, a Harris feature extractor submodule, and a

feature point counter submodule. Raw eight-bit 512512

input color images with a Bayer filter, F(x, t), are scannedin units of four pixels from the upper left to the lower

right using X and Y signals at 151.2 MHz. The color-to-

gray converter submodule converts RGB images into eight-

bit gray-level images I(x, t) in parallel with the four pixelsafter RGB conversion for input images with a Bayer filter.

In the Harris feature detector submodule, Ix and I

y are

calculated using 33 Prewitt operators, and

I2x,

IxI

y,

and

I2y in the adjacent 33 pixels(a= 3) are calculatedas 20-bit data using 160 adders and 12 multipliers in parallel

in units of four pixels at 151.2 MHz. The feature (x, t) iscalculated as 35-bit data by subtracting det C(x, t) with athree-bit shift value of(TrC(x, t))2 in units of four pixels

when = 0.0625; 16 multipliers, 24 adders, and four three-bit shifters are implemented for calculating (x, t).

The feature point counter submodule can obtain the posi-

tions of feature points as a 512512 binary map B(x, t) bythresholding(x, t) with a threshold T in parallel in unitsof four pixels at 151.2 MHz; B(x, t) indicates whether afeature point is located at pixel x at time t. The number of

feature points in the 55 adjacent area (p = 5) is countedfor all the feature points as a 512512 map P(x, t) using96 adders in parallel in units of four pixels; P(x, t) is usedfor checking closely crowded feature points in step (1-c).

2668


5/6


6/6

t=0.0s

t=1.0s

t=2.0s

t=3.0s

t=4.0s

t=0.5s

t=1.5s

t=2.5s

t=3.5s

t=4.5s

Fig. 5. Synthesized panoramic images

matrix and translation vector were set as unit matrix and zero

vector, respectively, at t = 0. It can be seen that most of the

feature points in the input images were correctly extracted.Figure 4 shows the frame interval of feature point matching,

and the x- and y-coordinates for translation vector b(t) fort= 0.05.0 s. Corresponding to the cameras motion, it canbe seen that a short frame interval was set for the high-speed

scene, and a long frame interval for the low-speed scene.

Figure 5 shows the synthesized panoramic images of

32001200 pixels, taken at intervals of 0.5 s. With time,the panoramic image was extended by accurately stitching

affine transformed input images over the panoramic image

at the previous frame, and large buildings and background

forests on the far side were observed in a single panoramic

image at t = 4.5 s. These figures indicate that our systemcan accurately generate a single synthesized image in real

time by stitching 512512 input images at 500 fps whenthe camera head is quickly moved by a human hand.

V. CONCLUSION

We developed a high-speed video mosaicing system that

can stitch 512512 images at 500 fps to give a singlesynthesized panoramic image in real time. On the basis of the

experimental results, we plan to improve our video mosaicing

system for more robust and long-time video mosaicing,

and to extend the application of this system to video-based

SLAM and other surveillance technologies in the real world.

REFERENCES

[1] B. Zitova and J. Flusser, Image registration methods: A survey,Image Vis. Comput., 21 , 9771000, 2003.

[2] J. Shi and C. Tomasi, Good features to track, Proc. IEEE Conf.Comput. Vis. Patt. Recog., 593-600, 1994.

[3] D.G. Lowe, Distinctive image features from scale-invariant key-points, Int. J. Comput. Vis., 60 , 91110, 2004.

[4] M. Kourogi, T. Kurata, J. Hoshino, and Y. Muraoka, Real-time imagemosaicing from a video sequence, Proc. Int. Conf. Image Proc., 133137, 1999.

[5] J. Civera, A.J. Davison, J.A. Magallon, and J.M.M. Montiel, Drift-free real-time sequential mosaicing,Int. J. Comput. Vis.,81, 128137,2009.

[6] R.H.C. de Souza, M. Okutomi, and A. Torii, Real-time imagemosaicing using non-rigid registration, Proc. 5th Pacific Rim Conf.on Advances in Image and Video Technology, 311322, 2011.

[7] Y. Watanabe, T. Komuro, and M. Ishikawa, 955-fps real-time shapemeasurement of a moving/deforming object using high-speed visionfor numerous-point analysis, Proc. IEEE Int. Conf. Robot. Autom.,31923197, 2007.

[8] S. Hirai, M. Zakoji, A. Masubuchi, and T. Tsuboi, Realtime FPGA-based vision system, J. Robot. Mechat., 17 , 401409, 2005.

[9] I. Ishii, R. Sukenobe, T. Taniguchi, and K. Yamamoto, Develop-ment of high-speed and real-time vision platform, H3 Vision, Proc.

IEEE/RSJ Int. Conf. Intelli. Rob. Sys., 36713678, 2009.[10] I. Ishii, T. Tatebe, Q. Gu, Y. Moriue, T. Takaki, and K. Tajima, 2000

fps real-time vision system with high-frame-rate video recording,Proc. IEEE Int. Conf. Robot. Autom., 15361541, 2010.

[11] C. Harris and M. Stephens, A combined corner and edge detector,Proc. the 4th Alvey Vis. Conf., 147151, 1988.

2670

real-time feature-based video mosaicing at 500 fps

Documents