a comparative study of calibration methods for kinect-style cameras

A Comparative Study of Calibration Methodsfor Kinect-style cameras

Aaron StaranowiczCSE Dept.

University of Texas at ArlingtonArlington, TX, USA

[email protected]

Gian-Luca MariottiniCSE Dept.

University of Texas at ArlingtonArlington, TX, [email protected]

ABSTRACTKinect-style (or Depth) cameras use both an RGB and adepth sensor that acquire color and per-pixel depth data(depth-map), respectively. Due to their affordable price andrich data they provide, depth cameras are being extensivelyused on research in assistive environments. Most of therobotic and computer-vision systems that use these Kinect-style cameras require an accurate knowledge of the camera-calibration parameters. Traditional calibration methods,e.g., those that use a checker-board pattern, cannot be straight-forwardly used to calibrate the Kinect-style cameras sincethe depth sensor can not distinguish patterns. Several cal-ibration methods have emerged that try to calibrate depthcameras. In this paper, we present a comparative studyof some of the most important Kinect-sytle calibration al-gorithms. Our work includes an implementation of thesemethods along with a comparison of their performance inboth simulation and real-world experiments.

Categories and Subject DescriptorsA.1 [General Literature]: General–Comparative Study;I.4.1 [Image Processing and Computer Vision]: Digi-tization and Image Capture–Camera Calibration

KeywordsDepth-Camera, Camera Calibration, Kinect

1. INTRODUCTIONThe Microsoft’s Kinect [1] depth camera consists of both

an RGB sensor and a depth-sensor, which can capture colorimages and per-pixel depth information (depth-map), re-spectively. Due to their low cost, wide availability, and therich information they provide (even in dark environments)depth cameras are being widely used in many applications,such as robotics [2], tracking of facial expressions [3], ges-ture recognition [4] and body monitoring in assistive envi-ronments [5,6]. Recent work has been produced that showsthe integration of Kinect-style cameras to help people withcognitive impairments [7].

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PETRA’12 June 6 - 8, 2012 Crete Island, GreeceCopyright 2012 ACM 978-1-4503-1300-1/12/06 ...$10.00.

Depth Sensor

RGB Sensor

(RDR, RDt)

(a)

?

(b)

Figure 1: (a) Kinect Camera has a depth and anRGB sensor. (b) Standard calibration checkerboardas observed by the two sensors. Note that corre-spondences between RGB and depth-map are hardto retrieve.

All the above applications require an exact knowledgeof the depth-camera calibration parameters, i.e., the focallengths and principal point of both the RGB and depth sen-sors [8], as well as their relative 3-D pose. These parameterscan be obtained only if an a-priori depth-camera calibrationphase has been performed.

Although color-camera calibration has been extensivelyinvestigated in the literature [9], the calibration of depthcameras presents some challenges. As illustrated the RGB-depth image pairs of Fig. 1, textures cannot be detectedfrom the depth map, so that regular calibration checker-board cannot be reliably used to this purpose. Furthermore,the corners or the edges of a checkerboard in the depth-mapare affected by a large amount of noise, which makes it diffi-cult to identify their exact location in correspondence withthose in the RGB image.

Several calibration algorithms have been presented in therecent years that try to achieve an accurate estimation ofthe depth-camera parameters. These new approaches shareseveral similarities, but have also important differences. Forexample, Fuchs and Hirzinger [10] present a multi-spline cal-ibration strategy which requires the camera to be mounted

on a robotic arm, in order to know its exact position. Thework in [11] uses a custom-made calibration panel with cir-cular holes at a specific distance. Other works have tried todevelop algorithms that use readily-available materials, e.g.,a regular calibration checkerboard [12,13], or a spherical ob-ject (e.g. soccer ball) of unknown radius [14].

In this paper we present a comparison between differentdepth-camera calibration methods, in particular [12–14]. Weimplemented several calibration methods and we provide athorough evaluation of their performance in both simulatedand real-world experiments. To the best of our knowledge,this is the first work to review and thoroughly compare theperformance of depth-camera calibration algorithms. Ourwork will help the user to select the best algorithm for cali-brating their own depth camera.

2. OVERVIEW OF DEPTH-CAMERA CAL-IBRATION

As mentioned in Sect. 1, calibrating a depth-camera con-sists of estimating the following parameters: the RGB-sensorintrinsic calibration matrix, RK, the depth-sensor intrinsiccalibration matrix, DK, and the rigid-body motion, (RDR,RDt),between the RGB sensor reference frame {R} and the depth-sensor reference frame {D}. In this section, we will describedifferent approaches to estimate these depth-camera calibra-tion parameters.

2.1 Zhang’s Method [12]Zhang’s algorithm first finds a closed-form estimate of

(RDR,RDt) and RK from a set of (at least 3) views of a groupof (at least 3) calibration checkerboards, observed from the{R} and {D}. Then, it adopts an iterative (non-linear)phase to refine the initial estimates. In this second phase,the authors also refine an initial estimate of the depth-sensorcalibration parameters.

The initial estimates of R

DR and R

Dt are obtained by meansof the least squares method described in [15]. This methoduses an estimate of the plane parameters (e.g., normal ofthe plane, n, and the distance from the origin, δ) in both{R} and {D}. The plane parameters Dni and

Dδi (in {D})are estimated by means of a LS plane fitting in the depth-map. The Rni and

Rδi are computed by using the estimatedrotation,

Pi

RR and translation,

Pi

Rt, from each plane, {Pi}, to

{R}. The estimated (Pi

RR,

Pi

Rt) are obtained by calculating

the homographies as in [9], and by decomposing them.The iterative step refines the initial estimates of R

DR, R

Dt,RK,

Pi

RR and

Pi

Rt by minimizing a cost function given by the

(weighted) sum of three negative log-likelihood functions.The first represents the average pixel reprojection error,while the second satisfies the constraint that if a 3-D pointis transformed from its local coordinate system to a modelplane, then its third component will be zero (in the plane’slocal coordinate system). The final log likelihood function isderived from the pixel projection error using correspondingpixel-point reprojection error from {D} to {R}. Note thatthis calibration method does iterate over the depth-mappingfunction but not DK.

2.2 Herrera’s Method [13]In this work the authors initially estimate DK and RK by

means of an homography-based calibration method [9], andassume that corresponding points in each RGB and depth-map images have been selected by the user. An initial es-timate of R

DR and R

Dt was then calculated by means of themethod described in [15]. While this method and the onein [12] are very similar in the first steps, however [13] in-

troduces a linear relationship between depth- and disparity-map values, which is based on two coefficients, α and β.The disparity is defined as the difference between the pro-jected pattern provided (by the structured-light projector)and a set of predetermined patterns which have pre-set dis-tances. These parameters are initially estimated using aset of manually-selected corners on the calibration checker-board.

The iterative step refines R

DR,RDt,RK,DK,Pi

RR,

Pi

Rt, α and

β by using a combination of weighted cost functions. Thefirst cost function represents the pixel reprojection error,i.e., the difference between the user selected pixel points inthe RGB image and the corresponding projected 3-D pointfrom the model plane. The second cost function representsthe difference between the estimated disparity and the depthmap. The estimated disparity is obtained from the abovelinear depth-mapping function. The depth of a pixel pointcan be calculated using the plane parameters (e.g., n and δ)from the initial step. As weights for these cost functions, theauthors use the standard deviations of the errors obtainedthroughout the (initial) calibration process and update theweights at each step of the minimization.

2.3 DCCT Method [14]Differently from [12,13], our recent work in [14] estimates

R

DR,RDt,RK and DK using a single spherical object and itsprojection onto the RGB image and depth map over multiple(at least six) views.

The method uses three phases to estimate the depth-camera calibration parameters. The first phase allows forfeatures to be automatically extracted and processed. Thesefeatures include the pixel points on the contour of the spherein the RGB image, and the point cloud (i.e., pixel points andassociated depth) that represents the sphere in the depthmap. The second phase obtains an initial LS estimate ofR

DR,RDt and RK by using a DLT-like closed-form solution.The DLT requires corresponding points which come fromthe projected center of the sphere in the RGB image andthe center of the point cloud in the depth map. The pro-jected center of the sphere is computed in closed-form fromthe center of the observed ellipse in {R}. The center of theellipse can be found by fitting an ellipse to the selected pixelpoints in the RGB image.

The third phase is the iterative step which refines the es-timates of R

DR,RDt, RK and DK. This phase uses a combi-nation of weighted cost functions. The first weighted costfunction is the Frobenius norm of the difference betweenthe fitted ellipse in {R} and projected quadric in {R}. Thesecond weighted cost function is the norm of the pixel re-projected error using the corresponding points which areprojected center of the sphere in {R} and the sphere centerin {D}. The cost functions are inversely weighted based onthe distance of the sphere center with respect to the camera.

3. EXPERIMENTAL VALIDATIONThis section details our simulation and experimental re-

sults and compares the performance of the three depth-camera calibration methods [12–14].

3.1 Simulation ResultsThe simulation scenario adopted for testing [12, 13] con-

sists of 12 planes (each with a 9× 9 grid) placed as a hemi-sphere. For testing [12], 9 cameras have been placed in the3D and are approximately 3 meters from the checkerboards.The scenario of [14], consists of a single view of 64 spheresplaced within a 3D cube at a specific distance from each

Method ||RD t− R

Dt|| |θr − θr| |θp − θp| |θy − θy | |fD

u − fD

u | |fD

v − fD

u | |uD

0 − uD

0 | |vD

0 − vD

0

Herrera 0.17± 0.25 1.18 ± 2.5 0.84 ± 1.8 1.09 ± 2.17 19.5 ± 51.6 19.47 ± 45.7 19.94 ± 45.6 19.47 ± 43.5Zhang 0.002 ± 0.004 0.04 ± 0.09 0.03 ± 0.06 0.009 ± 0.018 — — — —DCCT 5.47e − 4± .001 0.1± 0.021 0.01± 0.021 0.018 ± 0.036 2.4± 5.91 2.4± 5.55 0.46± 1.17 0.34± 0.69

Table 1: Table of Simulation Results

0 0.2 0.4 0.60

100

200

300

400

500

600

Reproj. Error [pix.]

No.Spheres

(a)

0 1 2 30

1000

2000

3000

4000

Reproj. Error [pix.]No.Points

(b)

0 20 40 60 800

5000

10000

15000

Reproj. Error [pix.]

No.Points

(c)

Figure 2: Simulation Results: (a) Distribution of the reprojection errors of [14]. (b) Distribution of thereprojection errors of [12]. (c) Distribution of the reprojection errors of [13]

other. Zero-mean Gaussian noise with standard deviation of0.5 pixels was added to both the RGB image and the depthmap. The depth values extracted from the depth map areperturbed with a random Gaussian noise with a standarddeviation of 0.001 meters. 50 independent realizations havebeen ran.

Due to the number of planes, their size and the positionof the cameras, the pixel noise has not been selected higherthan 0.5 pixels in order to maintain a realistic distributionof the imaged corners of the checkerboard.

The experiments reflect the details of simulation describedby [12] such that, we only implemented a part of their cali-bration method which was the minimization of R

DR and R

Dtand using a estimated RK and known DK. Similarly, weused the authors’, [13], publicly available code together withour implementation.

Table 1 shows the results of the simulation. Each columnshows the mean and the (three times) standard deviationof the estimation error of the camera-calibration parameterwith respect to the ground truth. The first column shows thetranslation error, ||RD t− R

Dt|| (meters). The second throughfourth columns contain the RGB-depth rotation errors interms of roll, pitch, and yaw angles (θr, θp, θy, in degrees).The fifth through the eighth columns contain the estimationerrors for the intrinsic parameters in DK (pixels).

As we can notice, Zhang and DCCT achieve better resultsthan Herrera in estimating the extrinsic camera parameters,moreover DCCT has better precision in terms of translationerror between the two cameras. This is mainly due to thefact that, since DCCT uses spherical objects and the fittingprocess adopted to estimate the sphere (in 3D) and the el-lipses (in the RGB image) suppress noise and outliers, thushaving a beneficial influence in the final camera calibration.

Fig. 2 shows the pixel reprojection errors for each of thethree methods. DCCT achieves a distribution of the errorbetween 0 and 0.4 pixels (c.f., Fig. 2(a)), while Fig. 2(b)(Zhang’s method) has a higher error distributed (from 0 to2 pixels), even if more points are used for this estimation.Fig. 2(c) shows the largest pixel reprojection error for Her-rera’s method; this is due to the larger sensitivity of this

(a)

(b)

Figure 3: Kinect RGB images and Depth map (a)DCCT Feature Extraction: (left) Image with es-timated ellipse. (right) Depth Map with selectedpoint cloud. (b) Zhang’s Feature Extraction: (left)Image with selected pixel points on calibration grid.(right) Meshgrid composed of the depth of eachpixel with selected pixel points.

method to image noise.

3.2 Experimental ResultsWe tested the accuracy and precision of each calibration

method by calibrating a Kinect camera. The initial RK isgiven by fR

u = 506.52 pixels, fR

v = 506.78 pixels, uR

0 = 329.7

Method R

Dt[m.] θr[deg.] θp[deg.] θy[deg.] fD

u [pix.] fD

v [pix.] uD

0 [pix.] vD

0 [pix.]

Zhang [−0.58; 0.83; 0.08] −0.1164 −0.1155 −0.0128 −−− −−− −−− −−−DCCT [−0.027;−0.0047;−0.028] 0.0079 −0.014 0.002 546.7 550.26 310.98 258.24

Table 2: Table of Experimental Results

pixels, and vR

0 = 265.89 pixels and was obtained by us-ing Bouguet’s MATLAB calibration toolbox [16]. The real-world scenario is shown in Fig. 3 which shows the RGBimage and depth map generated by the Kinect.

For [14], we used 50 images of a basketball where thebasketball was placed in front of the Kinect at different dis-tances (1 to 4 meters). The feature extraction was automat-ically done when a user selected a box around the sphere inboth RGB images and the depth map (see Fig. 3(a)).

For Zhang’s method in [12], we used a set of 14 imageswith 3 checkerboard patterns. The feature extraction con-sisted of selecting corresponding corners on the checkerboardpatterns in each image as shown in Fig. 3(b). A pre-set DKwas chosen exactly equal to the values provided in [12].

Table 2 shows the estimated parameters. The first col-umn shows the translation vector between the RGB and thedepth sensor (meters). The second to the fourth columnindicate the roll, pitch, and yaw angles (θr, θp, θy) consti-tuting R

DR. The fifth to eighth column are the focal lengthand principal point of DK in pixels. In this case we observethat, even if Herrera uses many more features to estimatethe depth-camera calibration, however the final accuracy islower than Zhang: this can be observed, e.g., from the valueof the translation vector, which should be around 2.5 cm(this is the correct value for the Kinect camera). While thisis achieved by DCCT, however Zhang’s method exhibits alarge error.

4. CONCLUSION AND FUTURE WORKSIn this paper, we presented a comparison of three recent

depth-camera calibration methods [12–14]. We implementedthese calibration methods and provided an extensive evalu-ation of their performance in both simulated and real-worldexperiments. This is the first work that compares the perfor-mance of these depth-camera calibration algorithms. Futurework will include the experimental evaluation of Herrera’salgorithms [13], and a more extensive simulation test in thecase of increasing pixel noise power.

5. ACKNOWLEDGEMENTSThis work was supported by the NSF scholar award to

attend doctoral consortium of PETRA 2012 (IIS-1238660).

6. REFERENCES[1] Microsoftr Kinect Camera. [Web]:

http://www.xbox.com/en-US/Kinect/.[2] V. Castaneda, D. Mateus, and N. Navab. SLAM

combining ToF and High-Resolution cameras. In Proc.IEEE Workshop Appl. Comp. Vision, pages 672–678,Kona, Hawaii, U.S., January 2011.

[3] Q. Cai, D. Gallup, C. Zhang, and Z. Zhang. 3DDeformable Face Tracking with a Commodity DepthCamera. In Proc. Eur. Conf. Comp. Vis., LectureNotes in Computer Science, pages 229–242, Crete,Greece, September 2010.

[4] A. Ramey, V. Gonzalez-Pacheco, and M.A. Salichs.Integration of a low-cost RGB-D sensor in a socialrobot for gesture recognition. In Proc. 6th Int. Conf.

Human-robot Inter., pages 229–230, Lausanne,Switzerland, March 2011.

[5] Y. Chang, S. Chen, and J. Huang. A kinect-basedsystem for physical rehabilitation: A pilot study foryoung adults with motor disabilities. Res. in Devel,Disab., 32(6):2566 – 2570, 2011.

[6] A.P.L. Bo, M. Hayashibe, and P. Poignet. Joint angleestimation in rehabilitation with inertial sensors andits integration with kinect. In Proc. IEEE Intl. Conf.Eng. in Med. and Bio. Society, pages 3479 –3483,September 2011.

[7] Yao-Jen Chang, L. Chou, F. Wang, and S. Chen. Akinect-based vocational task prompting system forindividuals with cognitive impairments. Per. andUbiq. Comp., pages 1–8, 2011.

[8] R. Hartley and A. Zisserman. Multiple View Geometryin Computer Vision. Cambridge University Press, 2ndedition, 2003.

[9] Z. Zhang. A Flexible New Technique for CameraCalibration. IEEE Trans. Pattern Anal. Mach. Intell.,22(11):1330–1334, November 2000.

[10] S. Fuchs and G. Hirzinger. Extrinsic and depthcalibration of ToF-cameras. In Proc. IEEE Conf.Comp. Vis. Pattern Rec., pages 1–6, Anchorage,Alaska, U.S., June 2008.

[11] J.Jung, Y. Jeong, J. Park, H. Ha, D. J. Kim, andI. Kweon. A Novel 2.5D Pattern for ExtrinsicCalibration of ToF and Camera Fusion System. InProc. IEEE/RSJ Intl. Conf. on Intel. Rob. Syst.,pages 3290 –3296, September 2011.

[12] C. Zhang and Z. Zhang. Calibration between Depthand Color Sensors for Commodity Depth Cameras. InIntl. Workshop on Hot Topics in 3D, in conjunctionwith ICME, Barcelona, Spain, July 2011.

[13] C.D. Herrera, J. Kannala, and J. Heikkila. Accurateand Practical Calibration of a Depth and ColorCamera Pair. In Proc. 14th Int. Conf. Comp. Anal.Images Patt., volume 6855 of Lecture Notes inComputer Science, pages 437–445. Springer, Seville,Spain, August 2011.

[14] A. Staranowicz, F. Morbidi, and G.L. Mariottini.Depth-camera calibration toolbox (dcct): accurate,robust, and practical calibration of depth cameras. InProc. of the Brit. Mach. Vision Conf., September2012. submitted.

[15] R. Unnikrishnan and M. Hebert. Fast extrinsiccalibration of a laser rangefinder to a camera.Technical Report CMU-RI-TR-05-09, RoboticsInstitute, Pittsburgh, PA, July 2005.

[16] Camera Calibration Toolbox for Matlab. [Web]: http://www.vision.caltech.edu/bouguetj/calib_doc/.

a comparative study of calibration methods for kinect-style cameras

Documents