[ieee 2011 annual ieee india conference (indicon) - hyderabad, india (2011.12.16-2011.12.18)] 2011...
TRANSCRIPT
Optimality in Homography in 3D Reconstruction
from Canonical Stereo Setup.
Avik Chatterjee Hothur Satheesh
CSIR-CMERI NIT Durgapur
Durgapur , India Durgapur , India
[email protected] [email protected]
Prof. Indrajit Basak Dr. S. Majumdar
NIT Durgapur CSIR-CMERI
Durgapur , India Durgapur , India
[email protected] [email protected]
Abstract— This paper presents an experimental study and
investigation of errors in 3D reconstruction from views in a
canonical stereo camera setup with point correspondences of a
grid known apriori. The objective of investigation is to find the
estimate of P by Direct Linear Transform (DLT) solution i.e by
minimizing the algebraic error (||AP||) subjected to normalizing
constraint ||P||=1 and then further minimizing the total cost
function (C) involving reprojection error and 3D geometric
error through iterative Levenberg-Marquardt technique, to see
whether the second level of minimization of C brings any
remarkable improvement in solution or not. Here A is the
measured value matrix with measurement error in image and
space point coordinates and P is the estimate of camera
projection matrix. The problem of reconstructed from stereo pair
is very much documented and well researched for both precisely
calibrated and uncalibrated cameras. But much literature is not
available in investigating the optimality of the homography by
minimizing the total cost function considering both geometric
error in the image pair and 3d geometric error term. We focus
our investigation in studying the results of optimizing the total
cost function including 3D geometric error term and see its
effects in homography and in reprojection.
Keywords- DLT, GSA, 3D reconstruction, camera calibration,
algebraic error, geometric error.
I. INTRODUCTION
The 3D reconstruction or space reconstruction relates to the techniques of recovering information about the structure of a 3D space based on direct measurements or depth computation from stereo image matching or from multiple view processing. This gives positions and dimensions of the sensed object surfaces and this information can, for instance, be used for robot navigation, guided surgery procedures, reconstruction of terrain from mapped data etc.
Estimation of the spatial coordinates of object points determined from stereo images of that environment is well documented and much studied problem and there are many established approaches to reach the goal. The pioneering work in this field has been done by Hartely, Zisserman [1][2][3] and Beardsely[2]. However, the current approaches to 3D reconstruction of a scene from stereo images, require either accurate and precise intrinsic and extrinsic camera parameters which is not always feasible. Reconstruction using uncalibrated camera is also documented in detail by Hartely [3][4] and others [5][6][7]. The availability of precise extrinsic and intrinsic camera parameters will result in full 3D Euclidian reconstruction. If only intrinsic parameters are available then reconstruction can be done up to a certain scaling factor. Non
availability of both intrinsic and extrinsic parameters will result in reconstruction up to certain projective transformation although much work has been done by Hartely in the field of Euclidian reconstruction from uncalibrated camera both in stereo view and multiview [3][4]. Another established and well documented general framework of 3D reconstruction from stereo view is computing the optimal fundamental matrix F,
from image correspondence ',↔i ix x and then recovering the
projection camera matrices P and 'P from F. Thereafter linear
triangulation is used to reconstruct the space point setiX , from
the correspondence '↔i ix x subjected to minimization of
cost function. This will result in projective reconstruction of
iX . There are many variations in this approach like, if we
know camera projection matrices K and 'K before hand, we can compute the essential matrix E and obtain a metric reconstruction.[1]
The objective of this investigation is to study the effect of minimizing the total cost function which consists of both geometric error in the image pair and 3D geometric error as suggested by Hartely and Zisserman [1] considering errors in measurement in both image and objects. The errors in measurement have been introduced not by measuring object coordinates w.r.t world coordinates very precisely (1~2 mm deviation). The errors in the image correspondences are in the range of 2~4 pixels as estimated by line fitting and corner detection techniques. Much literature in not available in checking the optimality condition considering the above mentioned total cost function along with measurement errors in image and object, although various other costs functions has been proposed and worked out. In this study the pin hole camera model has been chosen.
II. CAMERA CALIBRATION
Given a set of 3d space points iX in 3R and a set of
corresponding points ix in 2R image plane, the general finite
projective camera maps ↔i iX x according to xi = PXi, where
P is 3x4 homogeneous central finite camera projection matrix. In homogeneous coordinates, P can be written as P = diag( f,f,1) [ I | 0] where diag( f,f,1) is a diagonal matrix and [ I | 0] represents a matrix divided up into a 3 x 3 block (the identity
matrix) plus a column vector, here the zero vector. If T
x y[p ,p ]
are the coordinates of the principle point, then the above
relation can be expressed as [ ] camx = K I | 0 X where K is the
camera calibration matrix. Here camX is to emphasize that the
camera is assumed to be located at the origin of a Euclidean coordinate system with the principal axis of the camera pointing straight down the Z-axis, and is expressed in Camera Coordinate Frame(CCF). Considering camera rotation (R) and translation (T) w.r.t World Coordinate Frame (WCF), the equation in homogeneous coordinates can be written as
[ ]x = KR I | - C X where X is in WCF, which can be more
compacted as [ ]x = K R | T X , where T=-RC. The parameters
contained in K are called camera internal parameters or intrinsic parameters and parameters contained in [R | T] are responsible for camera position and orientation w.r.t WCF, are called external parameters or extrinsic parameters. Hence for simple finite pin hole camera model, P has 9 degrees of freedom: 3 for K ( f , px and py), 3 for R and 3 for T. For CCD camera the camera matrix has to be modified by
introducing parameters xm and
ym (number of pixels in unit
distance in image coordinates along x and y directions) for non square pixels and a skew parameters s for adding generality (Refer Appendix-A.1). A general finite projective camera has 11 degrees of freedom. This is the same number of degrees of freedom as a 3 x 4 matrix, defined up to an arbitrary scale.
Finite camera projection matrix (P) can also be expressed
as [ ]P = K R | - RC = [M | -MC] = i e M M where
iM is the
intrinsic 3x3 non singular camera matrix and
− e[R | RC] = M is the 3x4 extrinsic camera matrix. Camera
calibration in this context, is the process of determining the internal camera geometric and optical characteristics (intrinsic
parameters-iM ) and/or the 3D position and orientation of the
camera frame relative to WCF (extrinsic parameter-eM )[1][5].
Several methods for geometric camera calibration are presented in the literature. The classic approach solves the problem by minimizing a nonlinear error function. Due to slowness and computational burden of this technique, other closed-form solutions have been also suggested [6].
In this experiment, pixel skew factor (s) and the effects of lens distortion (barreling, pin cushioning etc) are neglected for simplicity, as we primarily focus our investigation on the contribution of 3D geometric error term on minimization of the cost function through iterative Levenberg-Marquardt (LM) technique [8][9], to see whether the second level of minimization of the cost function brings any remarkable improvement in solution or not.
The LM method is a blend between the Gradient Descent (GD) method and Gauss-Newton (GN) method. In GD method
the update is governed by 1 ( )k k k kx x f xα+ = − ∇ and
determination of kα (line search parameter) is an one
dimensional minimization problem of min ( ( ))k
k k kf x f x
αα− ∇ .It
will always progress provided the gradient is non zero but it has linear convergence rate and suffers from several convergence issues as it may not take large step when gradient is less (less slope) and small step when gradient is large (stiff slope). Another issue is that the curvature of the error surface may not be the same in all directions, resulting long narrow valleys, in which this method has difficulty in convergence.
This situation improved by using curvature as well as gradient
information, namely second derivative, 2 ,f∇ in GN Method,
where the function ( )f x is assumed to be quadratic giving rise
to the update rule 2 1
1 ( ( )) ( )k k k kx x f x f x−
+ = − ∇ ∇ . It has rapid
convergence but sensitive to linearity at starting location. These two methods are complimentary in advantages they provided, which has been exploited in LM method, by setting the update
rule as 1
1 ( [ ]) ( ),k k kx x H diag H f xλ −
+ = − + ∇ where H is the
Hessian Matrix ( 2 ( )kf x∇ ) and λ is a parameter. If the error
goes down following an update, it implies that the quadratic assumption on f (x) is working and λ is reduced (usually by a factor of 10) to minimize the influence of gradient descent. On the other hand, if the error goes up, which means the update has to follow the gradient more and so λ is increased by the same factor. Since the Hessian is proportional to the curvature
of ( ),f x , the update rule implies a large step in the direction
with low curvature and a small step in the direction with high curvature , which is desired.
III. EXPERIMENTAL SETUP
The cameras considered in this study are Logitech make high definition (HD 1280x720) webcam, model C910 in fixed focus mode. No information is available about focal length, size or type of the sensing chip. The canonical stereo setup (Figure–1) is made for reducing position and orientation in angular measurements of the camera frame. The space
pointsiX , in the Tsai grid are intentionally measured with in
the error of 2-3 mm w.r.t WCF. The image points ix are
measured in pixels and obtained through Canny’s edge detection, straight line fitting and intersection calculation to get the grid corners w.r.t to the image origin.
There are errors in the measurements of both object and
image points and hence the measured quantities are represented
as _
iX and _
ix . We require to find a estimate of 3x4 camera
matrix �P such that �_
i ix = P X for all i. Note that this is an
equation involving homogeneous vectors, thus the 3-vector _
ix
and �_
iP X not equal, they have the same direction but different magnitude by a non zero scale factor. The equation may be
expressed in terms of vector cross product as � ,_ _
i ix ×P X = 0
Figure-1: Experimental setup for canonical stereo.
[a] Included angle between Tsai grid is θ = 120o [b]
Included angle between same grid is θ =225o
which on simplification reduces to (1), where
( )i i ix , y , z ,T
= jT T ji i ix P X = X p and �
_
( .= 1T 2T 3T Ti i i iP X P X P X P X ) In (1)
the first two are linearly independent and the third one is linearly dependent. Hence from a set of n point correspondences, we obtain 2n x12 matrix by stacking up the equations for each point correspondence.
0......................................(1)
=
T T T 1
i i i i
T T T 2
i i i i
T T T 3
i i i i
z X 0 -x X P
0 -z X y X P
-y X x X 0 P
Projection matrix P is computed by solving the set of equations AP=0, where P is vector containing the entries of the matrix P. Rewriting and stacking up results the expression in Appendix-A.2.
For the minimal solution the matrix P has 12 entries, and (ignoring scale) 11 degrees of freedom, it is necessary to have 11 equations to solve for P for exact solution. Since each point correspondence leads to two equations, a minimum 5(1/2) such correspondences are required to solve for P. The (1/2) indicates that only one of the equations is used from the sixth point, so one needs only to know the x-coordinate (or alternatively the y- coordinate ) of the sixth image point. In general A will have rank 11, and the solution vector P is the 1-dimensional right null-space of A. If the data are not exact, because of noise in the point coordinates, and 6n ≥ point correspondences are
given, then there will not be an exact solution to the equation AP = 0. An estimation of solution for P may be obtained by minimizing an algebraic or geometric error. In the case of algebraic error (residual AP), the approach is to minimize
AP� � subject to some normalization constraint. Possible
constraints are (i) 1=P� � and (ii) 1=3P� � , where 3P is the
vector T
31 32 33(p ,p ,p ) , namely the first three entries in the last
row of P. Constraint 1=P� � has been used in this case. This is
equivalent to the solution scheme of finding SVD of A, (A = UDV
T ) and then after arranging the positive diagonal (D)
entries in descending order, P is the last column of V.
Nnormalization has been applied before implementing the algorithm as suggested [1] through isotropic scaling. Thus, the centroid of the points is translated to the origin, and their coordinates are scaled so that the RMS distance from the origin
is 3 .This approach is suitable for a compact distribution of
points. This solution is widely known as Minimal solution or Direct Linear Transform (DLT) solution. The estimate of P has been further refined from the concept of reducing the geometric error through an iterative method such as
Levenberg-Marquardt, assuming iX is accurately known and
ix has Gaussian noise as measurement error. Then the Maximum Likelihood estimate of P is
� � ,∑ ∑n n
2 2i i i i
P Pi i
min d(x , x ) = min d(x , PX ) where the geometric
error is � .∑n
2i i
i
d(x , x ) The DLT solution, or the minimal
solution, may be used as a starting point for the iterative
minimization. This solution is also called Gold Standard Algorithm (GSA)[1]. If we consider Gaussian measurement error both in image points and 3D points, then the Maximum Likelihood estimate of P is solution of (2). This estimate is
then denormalized for the original coordinates as �-1P = T PU
where T and U are similarity transforms for xi and Xi.
� � +∑n
2 2i i i i
Pi
min d(x , x ) d(X , X ) …………………..…..(2)
The general projective finite camera matrix P can be decomposed to extract the internal camera parameters (Mi) and the camera position and orientation parameters (Me). The camera centre C is a point for which PC = 0 and numerically this right null-vector may be obtained from the SVD of P.
Since [ ]P = K R | - RC = [M | -MC], K and R can be found
by RQ decomposition of M as KR = M. In that situation if the rotation matrix is not orthogonal we need to enforce orthogonality to R through SVD of R and replacing the diagonal matrix by identity matrix, since the singular values of an orthogonal matrix are all one.
IV. RECONSTRUCTION
The reconstruction problem is to construct the position of the
point in 3D space given two views (x and 'x ) and the most
likelihood camera projection matrix P and 'P . Back
projection for two corresponding points '↔x x doesn't work
because with image measurement error, the back projected
rays will be skew. If we consider the triangular method
X = τ(x,x',P,P') then we would like to have τ to be invariant
under projective transformation H, i.e
= -1 -1 -1τ(x,x', P,P') H (x, x',PH , P'H ) . If we adopt this goal,
minimizing error in projective space P3 will not work because
distance and perpendicularity relationships are not invariant in
P3. Instead of minimizing error in P3, we would like to
estimate a 3D point �X exactly satisfying
� � and 'x = PX x' = P'X and maximizing the likelihood of the
measurements under Gaussian error. As usual, the Maximum
Likelihood (ML) estimate under Gaussian errors minimizes
reprojection error. Since reprojection error only measures
distance in the image, the ML estimate will be invariant under
projective transformations of 3D-space.
First we have considered a simple linear estimate minimizing
algebraic error similar to the DLT that is not optimal. We use
the cross product to eliminate the homogeneous scale factor.
For each image we have x ×(PX) = 0 giving the expression
referred in Appendix-A.3. Taking two linearly independent
equations from each camera we obtain the system AX = 0,
where expression of A referred in Appendix-A.4. Hence we
have four equations with four homogeneous unknowns which
we can solve linearly, up to an indeterminate scale factor This
method is also called the Linear Triangulation Method which
is a the direct analogue of the DLT method and purely
suboptimal. As we have assumed Gaussian noise in
measurement we seek a solution by projecting the estimated
world point � iX in to two images at � � and 'iix x , for right and
left cameras respectively, so that the reprojection error
d and d' is minimized, where � �andd(x,x) d'(x', x') are the
Euclidian distance and hence the geometric error for left and
right images, e and e′ are the epipoles, x and x' are Maximum
Likelihood Estimates (MLE) for the true image point
correspondences, C and C′ are the camera centers for right
and left cameras respectively. (Figure-2)
The geometric error in the image is the cost function C1 such
that � �∑ 2 2
1C (x, x') = (d(x, x) + d'(x', x') ) ………………....(3)
Once x' and x are found, the world point estimate � X can be
derived by triangulation method, since the corresponding rays
will meet precisely in space. This cost function is modified by
introducing the 3D geometric error term for studying its
contribution of its error. If � X is the estimate such that
� ix = PX and _
iX is the noisy measured data, then the 3D
geometric error term is
� �∑ 2i i i i2C (X , X ) = d(X , X ) ………………………………..(4)
There fore the total cost function to be minimized is
� � � �α βi i i i1 2C(x, x, X , X ) = C (x, x) + C (X , X ) ………..……….(5)
where α and β are introduced as weights as the units of
measurements of image points and world points are different.
The total cost function C, is minimized using numerical
Levenberg-Marquardt method for different values of α and
β . This method is used as it has the merits of both the
Gradient Descent and Gauss-Newton methods. The update
algorithm can take large step in the direction with low
curvature and small step in the direction with high curvature.
The minimum can also be obtained non-iteratively by the
solution of a sixth-degree polynomial as suggested by
Hartley[4]. The iterations for minimizing the total cost function for
range of values for α is [1000,0.001] and β is [1000,0.001], in
multiples of 10. The Frobinious norm of the general projection matrix is identified as the evaluating criteria. The maximum
change of norm occurs at α=0.01 and β 1.0,= i.e β/α 100.0= .
V. RESULTS
Number of image points is n = 56 for both oθ=120 (case-1)
and oθ=225 (case-2). The reconstructed space points generated
using the camera projection matrix P derived through linear
DLT( DLTP ) named as �DLTX and through iterative GSA
(GSAP ) named as � ,GSAX are reprojected in to image for both
the cases (Figure-3).
The error between measured space points X and � DLTX is
named as DLT error, �e =(X-X ),DLTDLT and the error between
measured space points ( X ) and optimal reconstructed space
points by minimizing the total cost function � � ,i iC(x, x, X , X ) i.e
�X ,GSA named as GSA error, �e (X X ),= − GSAGSA as shown in
Figure-4 and Figure-6 for case-1 and case-2.
The reprojection error i.e difference in pixels coordinates
between the measured image points ( x ) and the reprojection of reconstructed space points through DLT (linear
triangulation i.e DLTP X ) and optimal GSA (i.e
GSAP X ) have
been found and shown in Figure-4c for case-1 left camera and
Figure-6c for case-2 left camera. Here it must be mentioned that in this case, minimization does not guarantee to be
converged at global minimum. There is possibility of
convergence at local minimum between the bounds. Finally
the triangular mesh reconstructed surfaces are generated and
shown in Figure-5 and Figure-7 for case-1 and case-2 .
Figure-2: Geometric error d and d' for left and right images.
(a)
(b)
Figure-3. Reprojection of reconstructed points for (a) case
-1 left camera and (b) case-2 left camera.
(a) (b) (c)
(a) (b) (c)
(a) (b) (c)
(a) (b) (c)
Figure-4. For case-1 left camera : (a) DLT error (b) GSA error (c) Reprojection error.
Figure-5. For case-1 left camera : Triangular mesh of reconstructed space: (a) _
iX (b) �DLTX (c) �GSAX
Figure-6. For case-2 left camera : (a) DLT error (b) GSA error (c) Reprojection error.
Figure-7. For case-2 left camera : Triangular mesh of reconstructed space: (a) _
iX (b) �DLTX (c) �GSAX
VI. DISCUSSION
The effect of minimizing the total cost function which consists of both geometric error in the image pair and 3D geometric error as suggested by Hartely and Zisserman[1], in considering errors in measurement in both image and objects, has been presented for an Tsai grid in two positions. It has been observed that in reconstruction through linear DLT, the error in Z direction ( direction of camera axis) is significant in comparison with X and Y directions. (Figure-4 and Figure-6) The inclusion of 3D geometric error term in the total cost function (Eq-5) does not improve camera projection matrix, so that the reconstructed points will move closure to their measured values. In contrast it increases the average Z coordinate error than DLT from 4.8 mm to 5.0mm in case-1 and even more in case-2 (Figure-6).
When the measured space points are reprojected on the
image by the projection matrix GSAP , obtained by minimizing
the total cost function, the reprojected points shows variation of 2± pixels in the image y-axis direction and the variation is oscillatory having no definite trend. This is true for
β/α 100.0= , where the maximum change in Frobenius norm of
the general projection matrix occurs
( ~ 4193.3).=DLT GSAP P Iteration on β/α > 1000 results in
insignificant change in norms
( ~ 0.00003)=DLT GSAP P suggesting no significant change in
matrix elements. This is also corroborated by the fact the error
levels eDLTand eGSA
remains unaltered, apart from being
suffering from ill conditioning and very slow convergence. For
β/α < 10 , the change in norm is lower, indicating no
significant change in the matrix elements and also no significant change in the error levels.
The most sensitive region for this problem is β/α 100.0= ,
but it does not significantly improve the solution offered by linear DLT method. In least-squares problems, given the
Jacobian matrix J, we can essentially get the Hessian 2 ( )kf x∇ ,
if the residuals ( )kr x themselves are small. In that case the
Hessian becomes 2 ( ) ( ) ( ),T
k k kf x J x J x∇ = which is
implemented in Levenberg-Marquardt method. In this particular nonlinear least square minimization problem, the residuals are large and hence quadratic approximation methods may not be suitable. This area needs further investigation, particularly in minimization strategy of total cost function C by adding the 3D geometric error term.
AP P E N D I X –A
A.1 Finite Camera Projective Matrix
x 0
y 0
α s x
K= α y
1
where xα = fxm , yα = f ym , 0x = xp xm and
0y = p y ym .
A.2 Stacked expression of P matrix.
11
12
13
1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
2 12
1 0 0 0 0
0 0 0 0 1
1 0 0 0 0
0 0 0 0 1
W W W W W W
W W W W W W
Wn Wn Wn n Wn n Wn n Wn n
Wn Wn Wn n Wn n Wn n Wn n n
P
P
P
X Y Z x X x Y x Z x P
X Y Z y X y Y y Z y
X Y Z x X x Y x Z x
X Y Z y X y Y y Z y×
− − − −
− − − − − − − −
− − − −
� � � � � � � � � � � �
� � � � � � � � � � � �
4
21
22
23
24
31
32
33
34 12 1
0
P
P
P
P
P
P
P
P×
=
A.3 Expression for x ×(PX) = 0
0
=
3T 1T
3T 2T
2T 1T
x(p X) - (p X)
y(p X) - (p X)
x(p X) - y(p X)
A.4 Expression for A matrix.
2
' ' '
' ' '
=
3T 1T
3T 2T
3T 1T
3T T
xp - p
yp - pA
x p - p
y p - p
ACKNOWLEDGMENT
The authors would like to thank Prof. Gautam Biswas, director of Central Mechanical Engineering Research Institute (CMERI), constituent establishment of Council of Scientific and Industrial Research (CSIR), New Delhi, for extending the facilities and infrastructure for carrying out experiments.
REFERENCES
[1] Richard Hartley, A. Zisserman, Multiple View Geometry In Computer Vision, Second Edition, Cambridge. 2003, Chapter-7.
[2] Paul Beardsley, Phil Torr and Andrew Zisserman, “3D model acquisition from extended image sequences,” Lecture Notes in Computer Science, volume 1065/1996, pp 683-695, 1996.
[3] R.I.Hartley, R. Gupta, and T. Chang, “Stereo from uncalibrated cameras,” In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp 761–764, 1992.
[4] R.I. Hartley, “Euclidean Reconstruction from Uncalibrated Views,” Proc. Conf. Computer Vision and Pattern Recognition, pp. 908-912, 1994.
[5] R.Y Tsai , “A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-shelf TV Cameras and Lenses,” IEEE Journal of Robotics and Automation, vol. RA-3, NO. 4, pp- 323-344, August 1987.
[6] Zhengyou Zhang , “A Flexible New Technique for Camera Calibration,”.
http://research.microsoft.com/en-us/um/people/zhang/Calib/
[7] Janne Heikkilä , Olli Silvén, “A Four-step Camera Calibration Procedure with Implicit Image Correction,” in Proc. IEEE conference on Computer Vision and Pattern Recognition, 1997.
[8] K. Levenberg, “A method for the solution of certain problems in least squares.”, Quart. Appl. Math., 1944, Vol. 2, pp. 164–168.
[9] D. Marquardt, “An algorithm for least-squares estimation of nonlinear parameters.”, SIAM J. Appl. Math., 1963, Vol. 11, pp. 431–441.