real-time feature-based video mosaicing at 500 fps
TRANSCRIPT
-
8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps
1/6
Real-time Feature-based Video Mosaicing at 500 fps
Ken-ichi Okumura, Sushil Raut, Qingyi Gu, Tadayoshi Aoyama, Takeshi Takaki, and Idaku Ishii
Abstract We conducted high-frame-rate (HFR) video mo-saicing for real-time synthesis of a panoramic image by imple-menting an improved feature-based video mosaicing algorithmon a field-programmable gate array (FPGA)-based high-speedvision platform. In the implementation of the mosaicing algo-rithm, feature point extraction was accelerated by implementinga parallel processing circuit module for Harris corner detectionin the FPGA on the high-speed vision platform. Feature pointcorrespondence matching can be executed for hundreds ofselected feature points in the current frame by searching thosein the previous frame in their neighbor ranges, assumingthat frame-to-frame image displacement becomes considerablysmaller in HFR vision. The system we developed can mosaic512512 images at 500 fps as a single synthesized image inreal time by stitching the images based on their estimated
frame-to-frame changes in displacement and orientation. Theresults of an experiment conducted, in which an outdoor scenewas captured using a hand-held camera-head that was quicklymoved by hand, verify the performance of our system.
I. INTRODUCTION
Video mosaicing is a video processing technique in which
multiple images are merged into a composite image that
covers a larger, more seamless view than the field of view
of the camera. Video mosaicing has been used to generate
panoramic pictures in many applications such as tracking,
surveillance, and augmented reality. Most video mosaicing
methods [1] determine the transform parameters between
images by calculating feature points using feature extractors
such as the KLT tracker [2] and SIFT features [3] to match
feature point correspondences between images. Many real-
time video mosaicing approaches for fast image registration
have been proposed. Kourogi et al. proposed a fast image reg-
istration method that uses pseudo motion vectors estimated
as gradient-based optical flow [4]. Civera et al. developed
a drift-free video mosaicing system that obtains a consistent
image from 320240 images at 30 fps using an EKF-SLAMapproach when previously viewed scenes are revisited [5].
de Souza et al. performed real-time video mosaicing without
over-deformation of mosaics using a non-rigid deformation
model [6]. Most of these proposals only apply to standardvideos at relatively low frame rates (e.g., NTSC 30 fps).
Further, the camera has to be operated slowly, without any
large displacement between frames, in order to prevent diffi-
culties in matching feature point correspondences. However,
in scenarios involving rapid camera motion, video images
at such low frame rates are insufficient for tracking feature
points in the images. Thus, there is a demand for real-time
K. Okumura, S. Raut, Q. Gu, T. Aoyama, T. Takaki, and I. Ishii. are withHiroshima University, Hiroshima 739-8527, Japan (corresponding author (I.Ishii) Tel: +81-82-424-7692; e-mail: [email protected]).
high-frame-rate (HFR) video mosaicing at rates of several
hundred frames per second and greater.
Many field-programmable gate array (FPGA)-based HFR
vision systems have been developed for real-time video
processing at hundreds of frames per second and greater
[7][9]. Although HFR vision systems can simultaneously
calculate scalar image features, they cannot transfer the entire
image at high speed to a personal computer (PC) owing to the
limitations in the transfer speed of its inner bus. We recently
developed IDP Express [10], an HFR vision platform that
can simultaneously process an HFR video and directly map
it onto memory allocated in a PC. Implementing a real-
time function to extract feature points in images and matchtheir correspondences between frames on such an FPGA-
based high-speed vision platform would make it possible to
accelerate video mosaicing at a higher frame rate than 30
fps, even when the camera moves quickly.
In this study, we developed a real-time video mosaicing
system that operates 512512 images at 500 fps. The sys-tem can simultaneously extract and enable correspondence
between feature points in images for estimation of frame-
to-frame displacement, and stitch the captured images into a
panoramic image to give a seamless wider field of view.
I I . ALGORITHM
Most video mosaicing algorithms are realized by executingthe following processes [1]: (1) feature point detection, (2)
feature point matching, (3) transform estimation, and (4)
image composition. In many applications, image composition
at dozens of frames per second is sufficient for the human
eye to monitor a panoramic image. However, feature tracking
at dozens of frames per second is not always accurate when
the camera is moving rapidly, because most feature points
are mistracked when image displacements between frames
become large. In HFR vision, frame-to-frame displacement
in feature points becomes considerably smaller. Narrowing
the search range by assuming HFRs reduces the inaccuracy
and computational load needed to match feature points
between frames even when hundreds of feature points areobserved from a rapidly moving camera, whereas image
composition at an HFR requires a heavy computational load.
We introduce a real-time video mosaicing algorithm that
can resolve this trade-off between tracking accuracy and
computational load in video mosaicing for rapid camera
motion. In our algorithm, steps (1)(3) are executed at an
HFR on an HFR vision platform, and step (4) is executed
at dozens of frames per second to enable the human eye
to monitor the panoramic image. In this study, we focus on
acceleration of steps (1)(3) for HFR feature point tracking
2013 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS)November 3-7, 2013. Tokyo, Japan
978-1-4673-6357-0/13/$31.00 2013 IEEE 2665
-
8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps
2/6
at hundreds of frames per second or more on the basis of
the following concepts:
(a) Feature point detection accelerated by hardware logic
Generally, feature point detection requires pixel-level com-
putation of the order O(N2), where N2 is the imagesize. Gradient-based feature point detection is suitable for
acceleration by hardware logic because the local calculation
of brightness gradients can be easily parallelized as pixel-level computation. In this study, we accelerate gradient-based
feature point detection by implementing its parallel circuit
logic on an FPGA-based HFR vision platform.
(b) Reduction in the number of feature points
Thousands of feature points are not always required for
estimating transform parameters between images in video
mosaicing. The computational load can be reduced by se-
lecting a smaller number of feature points because feature-
level computation of the order O(M2) is generally requiredin feature point matching; M is the number of selected
feature points. In this study, the number of feature points used
in feature point matching is reduced by excluding closely
crowded feature points as they often generate correspondingerrors between images in feature point matching.
(c) Narrowed search range by assuming HFR video
Feature point matching still requires heavy computation
of the order O(M2) if it is required that all the featurepoints correspond to each other between frames, even when a
reduced number of feature points are used. In an HFR video,
it can be assumed that frame-to-frame image displacement
grows considerably smaller, which allows a smaller search
range to be used for points corresponding to points in the
preceding frame. This narrowed search range reduces the
computational load of feature point matching to correspond-
ing feature points between frames on the order ofO(M).
Our algorithm for real-time HFR video mosaicing consists
of the following processes:
(1) Feature point detection
(1-a) Feature extraction
To extract feature points such as upper-left vertexes, we
use the following brightness gradient matrix C(x, t):
C(x, t) =
xNa(x)
I2x(x, t) I
x(x, t)I
y(x, t)Ix(x, t)I
y(x, t) I2y (x, t)
, (1)
where Na(x) is the a a adjacent area of pixel x = (x, y),and Ix(x, t) and I
y(x, t) indicate the following positivevalues ofIx(x, t) and Iy(x, t), respectively:
I(x, t) =
I(x, t) (I(x, t)> 0)
0 (otherwise) (= x, y), (2)
where Ix(x, t) and Iy(x, t) are x and y differentials of theinput image I(x, t) at pixel x at time t.
(x, t) is defined as a feature for feature tracking usingthe Harris corner detection [11] as follows:
(x, t) = det C(x, t) (TrC(x, t))2. (3)
Here, is a tunable sensitive parameter, and values in the
range 0.040.15 have been reported as feasible.
(1-b) Feature point detection
Thresholding is conducted for(x, t)with a threshold Tto obtain a map of feature points, R(x, t), as follows:
R(x, t) =
1 ((x, t)> T)0 (otherwise)
. (4)
The number of feature points when R(x, t) = 1 is countedin the p p adjacent area ofx as follows:
P(x, t) =
xNp(x)
R(x, t), (5)
where P(x, t) indicates the density of the feature points.
(1-c) Selection of feature points
To reduce the number of feature points, closely crowded
feature points are excluded by counting the number of feature
points in the neighborhood. The reduced set of feature points
R(t) is calculated at time t as follows:
R(t) = {x | P(x, t) P0} , (6)
whereP0 is a threshold used to sparsely select feature points.
We assume that the number of feature points is less than M.
(2) Feature point matching
(2-a) Template matching
To enable correspondence between feature points at the
current time t and those at the previous time tp= t nt,template matching is conducted for all the selected feature
points in an image. t is the shortest frame interval of avision system, and n is an integer. For template matching to
enable the correspondence of the i-th feature point at timetpbelonging toR(tp),xi(tp) (1 i M), to thei
-th feature
point at time t belonging to R (t), xi(t) (1 i M), the
sum of absolute difference (SAD) is calculated as follows:
E(i, i; t, tp) =
=(,)Wm
|I(xi(t)+, t) I(xi(tp)+, tp)|, (7)
where Wm is an m m window of template matching.To reduce the number of mismatched points, x(xi(tp); t),
which indicates the feature point at time t corresponding to
the i-th feature point xi(tp) at time tp, and x(xi(t); tp),which indicates the feature point at time tp corresponding
to the i-th feature point xi(t) at time t, are bidirectionallysearched by selecting the feature points when E(i, i; t, tp)is at a minimum in their adjacent areas as follows:
x(xi(tp); t) = xi(i)(t) = arg minxi (t)Nb(xi(tp))
E(i, i; t, tp), (8)
x(xi(t); tp) = xi(i)(tp) = arg minxi(tp)Nb(xi (t))
E(i, i; t, tp), (9)
wherei(i) and i(i)are determined as the index numbers ofthe feature point at time t, corresponding to xi(tp), and thefeature point at time tp, corresponding toxi(t), respectively.The pair of feature points between time t and tp are selected
when the corresponding feature points are mutually selected
as the same points as follows:
x(xi(tp); t) =
x(xi(tp); t) (i= i(i
(i))) (otherwise)
, (10)
2666
-
8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps
3/6
where indicates that no feature point at time t correspondsto the i-th feature point xi(tp) at time tp, and the featurepoints at time t that have no corresponding point at time tp,
are adjudged to be newly appearing feature points.
We assume that the image displacement between times t
and tp is small, and the feature point xi(t) at time t can bematched with a feature point at time tp in the b b adjacentarea ofxi(t). The processes described in Eqs. (8)(10) areconducted for all the feature points belonging to R(tp) andR(t), and we can reduce the computational load of featurepoint matching in the order ofO(M) by setting a narrowedsearch range with a small value for b.
(2-b) Frame interval selection
To avoid accumulated registration errors caused by small
frame-to-frame displacements, the frame interval of feature
point matching is adjusted for feature point matching at time
t + t using the following averaged distance among thecorresponding feature points at times t and tp:
d(t; tp) =
xi(tp)
Q(t;tp)
|x(xi(tp); t)xi(tp)|
S(Q(t; tp)) , (11)
where Q(t; tp) is a set of the feature points xi(tp) at timetp that satisfy x(xi(tp); t)= (i= 1, , M). S(Q(t; tp))is the number of elements belonging to Q(t; tp).
Depending on whether d(t; tp) is larger than d0, n = n(t)is determined for feature point matching at time t + t; theframe interval of feature point matching is initially set tot.
(2-b-i) d(t; tp) d0,The image displacement between times t and tp= tp(t)is
too small for feature point matching. To increase the frame-
to-frame displacement at the next time t + t, the parameter
n(t + t) is incremented by one as follows:n(t + t) = n(t) + 1. (12)
This indicates that the feature points at time t+ t arematched with those at time tp(t+t) = tn(t)t. Withoutconducting any process in steps (3) and (4) below, return to
step (1) for the next time interval t + t.
(2-b-ii) d(t; tp)> d0To reset the frame-to-frame displacement at time t + t,
n(t + t) is set to one as follows:
n(t + t) = 1. (13)
This indicates that the feature points at time t+ t arematched with those at time tp(t + t) = t. Thereafter, go tosteps (3) and (4); the selected pairs of feature points, xi(t)(Q(t; tp)) and x(xi(t); tp), are used in steps (3) and (4).
(3) Transform estimation
Using the selected pairs of feature points, affine parameters
between the two images at times t and tp are estimated as
follows:
xTi (t) =
a1 a2a3 a4
xT(xi(t); tp) +
a5a6
, (14)
= A(t; tp)xT(xi(t); tp) +b
T(t; tp), (15)
where aj (j = 1, , 6) are components of the transformmatrixA(t; tp)and translation vectorb(t; tp)that express theaffine transform relationship between the two images at times
t and tp. In this study, they are estimated by minimizing the
following Tukeys biweight evaluation functions,E1 and E2:
El=
xi(tp)Q(t;tp)
wli d2li (l= 1, 2). (16)
Deviations d1i and d2i are given as follows:
d1i= |x
i (a1x
i+ a2y
i+ a5)|, (17)
d2i= |y
i (a3x
i+ a4y
i+ a6)|, (18)
where x(xi(t); tp) = (x
i, y
i) and x
i(t) = (x
i , y
i) indicatethe temporally estimated location of the i-th feature point at
time t in Tukeys biweight method. The following processes
are iteratively executed to reduce estimation errors caused
by distantly mismatched pairs of feature points. Here, the
weights w1i and w2i are set to one and x
i(t) set to xi(t)as their initial values.
(3-a) Affine parameter estimation
On the basis of the least squares method to minimize E1andE2, affine parametersaj(j = 1, , 6)are estimated bycalculating the weighted product sums of the xy coordinates
of the selected feature points as follows:
a1a2
a5
=
i
w1ix2i
i
w1ix
iy
i
i
w1ix
ii
w1ix
iy
i
i
w1iy2i
i
w1iy
ii
w1ix
i
i
w1iy
i
i
w1i
1
i
w1ix
i x
ii
w1ix
i y
ii
w1ix
i
, (19)
a3a4a6
=i
w2ix2i
i
w2ix
iy
ii
w2ix
ii
w2ix
iy
i
i
w2iy2i
i
w2iy
ii
w2ix
i
i
w2iy
i
i
w2i
1
i
w2iy
ix
ii
w2iy
iy
ii
w2iy
i
. (20)
Using the affine parameters estimated in Eqs. (19) and
(20), the temporally estimated locations of feature points can
be updated as follows:
xTi (t) =
a1 a2a3 a4
xT(xi(t); tp) +
a5a6
. (21)
(3-b) Updating weights
To reduce the errors caused by distantly mismatched pairs
of feature points, w1i and w2i are updated for the i-th pairof feature points using the deviations d1i and d2i as follows:
w1iw2i
=
1 d21iW2u
2
1 d22iW2u
2 (d1i Wu, d2i Wu)
00
(otherwise)
. (22)
Wu is a parameter used to determine the cut point for
evaluation functions at u time iteration. As the number of
2667
-
8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps
4/6
iterationsu is larger, Wu is set to a smaller value. Increment
u by one and return to step (3-a) during u U.
AfterU time iteration, the affine parameters between two
input images at times t and tp are determined, and the
affine transform matrix and vector are obtained. A(t; tp)andb(t; tp)are accumulated with the affine parameters estimatedat time tp as follows:
A(t) = A(t; tp)A(tp), (23)
bT(t) = A(tp)bT(t; tp)+b
T(tp), (24)
where A(t) and b(t) indicate the affine parameters of theinput image at time t, compared with the input image at
time 0; A(0) and b(0) are initially given as the unit matrixand zero vector, respectively.
(4) Image composition
To monitor the panoramic image at intervals oft, thepanoramic image G(x, t) is updated by attaching an affine-transformed input image at time t to the panoramic image
G(x, t t) at time t t as follows:
G(x, t) =
I(x
, t) (ifI(x
, t)=)G(x, t t) (otherwise)
, (25)
where the panoramic image G(x, t) is initially given as anempty image at time 0; x indicates the affine-transformed
coordinate vector at time t as follows:
xT =A(t)xT +bT(t). (26)
Our video mosaicing algorithm can be efficiently executed
as a multi-rate video processing method. The interval in steps
(1) and (2) can be set to the shortest frame interval t foraccurate feature point tracking in HFR vision, while that in
step (3) depends on the frame-to-frame image displacement;
it is the shortest frame interval when there is significant cam-era motion, and it becomes larger when the camera motion
becomes smaller. On the other hand, the interval in step (4),
t, can be independently set to dozens of millisecond toenable the human eye to monitor the panoramic image.
III. SYSTEM I MPLEMENTATION
A. High-speed vision platform: IDP Express
To realize real-time video mosaicing at an HFR, our
video-mosaicing method was implemented on the high-speed
vision platform IDP Express [10]. The implemented system
consisted of a camera head, a dedicated FPGA board (IDP
Express board), and a PC. The camera head was highly
compact and could easily be held by a human hand. Itsdimensions and weight were 232377 mm and 145 g,respectively. On the camera head, eight-bit RGB images
of 512512 pixels could be captured at 2000 fps with aBayer filter on its CMOS image sensor. The IDP Express
board is designed for high-speed processing and recording
of 512512 images transferred from the camera head at2000 fps. On the IDP Express board, hardware logic can be
implemented on a user-specified FPGA (Xilinx XC3S5000-
4FG900) for acceleration of image processing algorithms.
The 512512 input images, and the processed results on the
thresholding
submodule
grayimage
4 parallel
color-to-gray
convertor
multiplication
&
summation
gradientmatrix C
4 parallel 4 parallel
trace&
determinant
2'xI
2'y
I
yx II ''
4 parallel
20bit34
featuremultiplication
&
subtraction
CTr
Cdet
16 parallel
35bitx4
36bit435bitx4
Harris Corner Detector Submodule
8bitx4
),( tF x
colorinputimage
),( tI x8bitx4
featurepoint
counter
4 parallel
),( tx
featurepointmap
1bitx4
),( tB x ),( tP x8bitx4
numberof
featurepoints
8bitx4
),( tF x
colorinputimage
Feature Point Extraction Circuit Module
Fig. 1. Schematic data flow of a feature point extraction circuit module.
IDP Express board, can be memory-mapped via 16-lane PCI-
e buses in real time at 2000 fps onto the allocated memories
in the PC, and arbitrary processing for the execution of
software on the PC can be programmed with an ASUSTek
P5E mainboard, Core 2 Quad9300 bulk 2.50 GHz CPU,
4 GB memory, and a Windows XP Professional 32-bit OS.
B. Implemented hardware logic
To accelerate steps (1-a) and (1-b) in our video mosaic-ing method, we designed a feature point extraction circuit
module in the user-specific FPGA on the IDP Express;
steps (1-a) and (1-b) have a computational complexity of
O(N2). The other processes (steps (1-c), (2), (3), and (4))were implemented in software on the PC because, with their
computational complexity ofO(M), they can be acceleratedby reducing the number of feature points.
Figure 1 is a schematic data flow of the feature point
extraction circuit module implemented in the FPGA on the
IDP Express board. It consists of a color-to-gray converter
submodule, a Harris feature extractor submodule, and a
feature point counter submodule. Raw eight-bit 512512
input color images with a Bayer filter, F(x, t), are scannedin units of four pixels from the upper left to the lower
right using X and Y signals at 151.2 MHz. The color-to-
gray converter submodule converts RGB images into eight-
bit gray-level images I(x, t) in parallel with the four pixelsafter RGB conversion for input images with a Bayer filter.
In the Harris feature detector submodule, Ix and I
y are
calculated using 33 Prewitt operators, and
I2x,
IxI
y,
and
I2y in the adjacent 33 pixels(a= 3) are calculatedas 20-bit data using 160 adders and 12 multipliers in parallel
in units of four pixels at 151.2 MHz. The feature (x, t) iscalculated as 35-bit data by subtracting det C(x, t) with athree-bit shift value of(TrC(x, t))2 in units of four pixels
when = 0.0625; 16 multipliers, 24 adders, and four three-bit shifters are implemented for calculating (x, t).
The feature point counter submodule can obtain the posi-
tions of feature points as a 512512 binary map B(x, t) bythresholding(x, t) with a threshold T in parallel in unitsof four pixels at 151.2 MHz; B(x, t) indicates whether afeature point is located at pixel x at time t. The number of
feature points in the 55 adjacent area (p = 5) is countedfor all the feature points as a 512512 map P(x, t) using96 adders in parallel in units of four pixels; P(x, t) is usedfor checking closely crowded feature points in step (1-c).
2668
-
8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps
5/6
-
8/10/2019 Real-time Feature-based Video Mosaicing at 500 Fps
6/6
t=0.0s
t=1.0s
t=2.0s
t=3.0s
t=4.0s
t=0.5s
t=1.5s
t=2.5s
t=3.5s
t=4.5s
Fig. 5. Synthesized panoramic images
matrix and translation vector were set as unit matrix and zero
vector, respectively, at t = 0. It can be seen that most of the
feature points in the input images were correctly extracted.Figure 4 shows the frame interval of feature point matching,
and the x- and y-coordinates for translation vector b(t) fort= 0.05.0 s. Corresponding to the cameras motion, it canbe seen that a short frame interval was set for the high-speed
scene, and a long frame interval for the low-speed scene.
Figure 5 shows the synthesized panoramic images of
32001200 pixels, taken at intervals of 0.5 s. With time,the panoramic image was extended by accurately stitching
affine transformed input images over the panoramic image
at the previous frame, and large buildings and background
forests on the far side were observed in a single panoramic
image at t = 4.5 s. These figures indicate that our systemcan accurately generate a single synthesized image in real
time by stitching 512512 input images at 500 fps whenthe camera head is quickly moved by a human hand.
V. CONCLUSION
We developed a high-speed video mosaicing system that
can stitch 512512 images at 500 fps to give a singlesynthesized panoramic image in real time. On the basis of the
experimental results, we plan to improve our video mosaicing
system for more robust and long-time video mosaicing,
and to extend the application of this system to video-based
SLAM and other surveillance technologies in the real world.
REFERENCES
[1] B. Zitova and J. Flusser, Image registration methods: A survey,Image Vis. Comput., 21 , 9771000, 2003.
[2] J. Shi and C. Tomasi, Good features to track, Proc. IEEE Conf.Comput. Vis. Patt. Recog., 593-600, 1994.
[3] D.G. Lowe, Distinctive image features from scale-invariant key-points, Int. J. Comput. Vis., 60 , 91110, 2004.
[4] M. Kourogi, T. Kurata, J. Hoshino, and Y. Muraoka, Real-time imagemosaicing from a video sequence, Proc. Int. Conf. Image Proc., 133137, 1999.
[5] J. Civera, A.J. Davison, J.A. Magallon, and J.M.M. Montiel, Drift-free real-time sequential mosaicing,Int. J. Comput. Vis.,81, 128137,2009.
[6] R.H.C. de Souza, M. Okutomi, and A. Torii, Real-time imagemosaicing using non-rigid registration, Proc. 5th Pacific Rim Conf.on Advances in Image and Video Technology, 311322, 2011.
[7] Y. Watanabe, T. Komuro, and M. Ishikawa, 955-fps real-time shapemeasurement of a moving/deforming object using high-speed visionfor numerous-point analysis, Proc. IEEE Int. Conf. Robot. Autom.,31923197, 2007.
[8] S. Hirai, M. Zakoji, A. Masubuchi, and T. Tsuboi, Realtime FPGA-based vision system, J. Robot. Mechat., 17 , 401409, 2005.
[9] I. Ishii, R. Sukenobe, T. Taniguchi, and K. Yamamoto, Develop-ment of high-speed and real-time vision platform, H3 Vision, Proc.
IEEE/RSJ Int. Conf. Intelli. Rob. Sys., 36713678, 2009.[10] I. Ishii, T. Tatebe, Q. Gu, Y. Moriue, T. Takaki, and K. Tajima, 2000
fps real-time vision system with high-frame-rate video recording,Proc. IEEE Int. Conf. Robot. Autom., 15361541, 2010.
[11] C. Harris and M. Stephens, A combined corner and edge detector,Proc. the 4th Alvey Vis. Conf., 147151, 1988.
2670