employing the hand as an interface device...employing the hand as an interface device afshin...

Employing the Hand as an Interface DeviceAfshin Sepehri, Yaser Yacoob, Larry S. Davis

Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, USAEmail: {afshin, yaser, lsd}@umiacs.umd.edu

Abstract— We propose algorithms and applications for usingthe hand as an interface device in virtual and physicalspaces. In virtual drawing, by tracking the hand in 3-D and estimating a virtual plane in space, the intendeddrawing of user is recognized. In a virtual marble game,the instantaneous orientation of the hand is simulated torender a graphical scene of the game board. Real-timevisual feedback allows the user to navigate a virtual ballin a maze. In 3-D model construction, the system tracks thehand motion in space while the user is traversing edges ofa physical object. The object is then rendered virtually bythe computer. These applications involve estimating the 3-Dabsolute position and/or orientation of the hand in space.We propose parametric modelling of the central regionof the hand to extract this information. A stereo camerais used to first build a preliminary disparity map of thehand. Then, the best fitting plane to the disparity pointsis computed using robust estimation. The 3-D hand planeis calculated based on the disparity plane and the positionand orientation parameters of the hand. Tracking the handregion over a sequence of frames and coping with noise usingrobust modelling of the hand motion enables estimating thetrajectory of the hand in space. The algorithms are real-timeand experiments are presented to demonstrate the proposedapplications of using the hand as an interface device.

Index Terms— Virtual Drawing, Virtual Game, 3-D ModelConstruction, Parametric Hand Tracking, Disparity Mod-elling, Motion Modelling, Stereopsis

I. I NTRODUCTION

The human hand serves a dual purpose as a communi-cation and manipulation device. This paper is focused onemploying the hand as an interface device to a computer.It presents applications that require accurate estimationof the position and orientation of the hand in space withrespect to a camera system. We describe a real-time stereosystem to estimate the position and orientation of thehand in the camera and world coordinate systems anddemonstrate its utility in virtual and real spaces usingthree applications:

1) A virtual drawing application, in which a user canwrite letters or draw on a virtual plane in space.

2) A 3-D model construction application, in which theuser runs his hand along the edges of a physicalpolyhedral object, and the system constructs a 3-Dmodel of that object, and

3) A 3-D virtual marble game, in which the usercontrols the inclination of a virtual plane throughhand motions to manipulate the movement of a ballthrough a maze.

The first two applications demonstrate the accuracy ofthe position and orientation estimation algorithms, while

the third demonstrates the real time capabilities of ouralgorithms.

A. Virtual Drawing in Space

Employing the hand as a means for human-computerinteraction has been explored extensively in the past fewyears. Using the hand as a 3-D mouse [1], a virtual gun[2], and a remote controller [3] are just a few examples.Communicatingalphabetsto a computer through handmovements is a powerful way for entering information.Much research has been performed to interpret handgestures as sign language alphabets [4], a method useful topeople with speech disability. However, people typicallyinput information through writing natural language andtyping at a keyboard, if it is available. Using a keyboardrequires a virtual visible keyboard so that user can movethe hand to press a desired letter. However, writing lettersdoes not need such visual feedback. Moreover, shapesother than alphabet symbols can be specified in the sameway.

In [5] it is shown how to use paper and the fingertip as apanel and pointer to draw sketches and writings. Trackingthe 3-D position and orientation of the panel makes it aflexible tool for writing. Nam and Wohn [6] showed howHidden Markov models can be used to recognize drawingsmade by moving the hand in space. They used a one-hand VPL Dataglove and an attached Polhemus trackerto record the angles of the fingers as well as the 3-Dabsolute position of the hand in space. They assume thatthere is no hand posture or orientation change while thehand is drawing. In [7] two cameras looking at the handfrom the top and the side model the back of the hand asa square to estimate the orientation of the hand and thenthe index finger tip is tracked in 3-D.

In our approach, we employ parametric models forfitting both disparity in stereo pairs and motion in monoc-ular video for tracking the hand region in 3-D. Wedo not require the user to maintain her hand in anyparticular pose (e.g., stretched and separated fingers), buttrack the hand in natural poses that people typically usewhile writing, for example. We take advantage of theobservation that when a person writes (especially usinglarge fonts such as writing on a board), she usually keepsher hand almost rigid and maintains a constant hand posethroughout the writing. As a result, the transformationbetween a particular point on the hand and the pen pointis almost constant. Hence, we can track a fixed point (in3-D) on the hand to determine what the person is writing

18 JOURNAL OF MULTIMEDIA, VOL. 1, NO. 7, NOVEMBER/DECEMBER 2006

© 2006 ACADEMY PUBLISHER

or drawing. We describe a vision-based system for virtualdrawing in space without pen and paper (or board).

Figure 1. Sample frames of writing letterB in space.

Writing is mostly a 2D activity, except when the handis lifted off the writing plane. Letters or drawings aresketched on a planar piece of paper as a well-connectedseries of points. However, writing in space and trackingthe hand in a frame-based manner using stereo providesa set of unconnected 3-D points. The distance betweentwo consecutive points is determined by the speed ofthe hand, which is normally not constant. Convertingthe set of 3-D points to a 2D continuous contour isan important component of the application. We developuniform sampling and planar modelling that allows usto derive an accurate 2D continuous contour. Writinginvolves two types of pose and motion of the hand:on-plane when the hand is writing, and transientoff-planemotions performed as gaps between letters or figures.Differentiating between these two activities is essentialtovirtual writing. We use incremental planar modelling todetect on-plane termination. For initialization, cooperationfrom the user is expected. Figure 1 shows a few frames ofa video sequence of writing the letterB in space and theoutput of our system. More comprehensive experimentalresults are presented in section IX.

Figure 2 shows the block diagram of the drawingsystem. The process is as follows:

Figure 2. Block diagram of the system.

1) Images are obtained from the stereo camera set,the central region of the hand is segmented, thedisparity map and motion field are estimated andmodelled and the center of the hand region istracked as a fixed reference point to provide a set

of 3-D points which determines the hand trajectoryin space.

2) At each time instant, the set of calculated 3-Dpoints are fitted with a plane and the stateson-plane (when the hand writes) andoff-plane (whenthe hand is in transition between letters or shapes)are detected. The set of points in the last on-planestate is projected to a plane parallel to the imageplane and a 2D point set is constructed. SectionVIII explains the algorithms in detail.

3) If the user intends to draw a multi-segment figure,an extra step including some orthographic and per-spective projection is required to retain the relativesize and displacement of the disjoint segments.Section VIII-D has more details.

Figure 3. Sample frames of a virtual marble game

B. Virtual Marble Game

Visual tracking of human body parts is being used inthe game industry [8], [9]. Also, manipulating virtualobjects not only eliminates the need for constructingexpensive physical simulators, but can also support moreflexibility. Virtual marble game, which resembles a phys-ical toy marble game, is an example of such a virtualobject. In this game, the user moves a ball through thehallways of a maze to reach a predefined goal location.The user performs this by moving the hand, making asuitable ramp for the ball which moves using virtualgravity. In a virtual marble game, the user rotates herhand while the system tracks the hand orientation andsimulates the marble board tilts. The system also providesvisual feedback of the virtual marble board and the currentposition of the ball so that the user can adjust her handorientation to navigate the virtual ball toward the goal.Figure 3 shows different frames of a sample virtual marblegame where both the hand images taken by the cameraand the visual scene the user sees are shown.

C. 3-D Model Construction

In this application, a user moves her hand over theedges of a physical 3-D object and the system tracksthe hand to measure the dimensions of the object andto render the object virtually. We assume that the user’shand is held rigid with respect to the edges of the objectand the back of the hand remains visible throughout.Figure 4 shows sample frames in which a user moves hishand along three orthogonal sides of a box. Measurements

JOURNAL OF MULTIMEDIA, VOL. 1, NO. 7, NOVEMBER/DECEMBER 2006 19


Figure 4. Sample Frames of the hand traversing three orthogonal sidesof a box.

performed demonstrate the accuracy of the hand trackingmethod.

II. H AND TRACKING SYSTEM OVERVIEW

Figure 2 shows the block diagram of the system; itsmain steps are as follows:

1) Images are grabbed from a stereo camera with abaseline comparable to the distance between thehuman eyes. Figure 5 shows a sample pair of inputimages. Input images are rectified to make disparitymap estimation faster.

(a) (b)

Figure 5. A Sample Stereo Input Images: (a) Left Camera, (b) RightCamera.

2) Background subtraction and skin color detection areemployed to segment the hand. Also, for reliabletracking, the fingers and the arm are removed fromthe hand area so only the central region of the hand(i.e. palm, back of the hand) remains.

3) A disparity map is estimated from the two imagestaken at each time instant using a parametric planarmodel to cope with the nearly textureless surface ofthe hand.

4) A monocular motion field is estimated from twoconsecutive frames. It is modelled similarly to thedisparity map. Parameters of the motion model arethen adjusted to comply with the disparity model.The motion field is used for tracking selected pointsthroughout a sequence.

5) At each time instant, theX, Y andZ coordinatesof the position and the orientation anglesyaw,pitch, androll are calculated for a coordinate frameattached to the palm. The 3-D plane parameters arecalculated from the disparity plane.

6) For tracking the hand over time, a set of 2D imagepoints are extracted from the images of one ofthe two cameras (e.g. left) and its motion model.Then, using disparity models at different times, thepoints are mapped to the 3-D world to provide thetrajectory of the hand in space.

III. R EGION OF INTERESTSEGMENTATION

The central region of the hand (i.e. palm or back ofthe hand depending on the user’s preference) is modelledand tracked in 3-D. That region is segmented in twosteps: segmenting the entire hand from the image, andthen selecting the central region from the segmented handregion. The following two subsections discuss these steps.

A. Hand Region Segmentation

Segmenting the hand from the image is performed byremoving the background and moving objects other thanthe hand. Two cues are used:

• Motion Cuesincluding background subtraction andmotion-less region subtraction.

• Color cueswhich take advantage of the fact thathuman skin color is localized in color space.

(a) (b) (c)

(d) (e) (f)

Figure 6. Hand Region Segmentation: (a) Input Image, (b) Back-ground Image, (c) Foreground Image, (d) Color Detector OutputwithoutBackground Subtraction , (e) Color Detector Output with BackgroundSubtraction. (f) Final Segmented Hand Region

We use fusion of color and background subtraction toextract the hand, with the color analysis applied to theresults of background subtraction. Figures 6(b) and 6(c)show the background and foreground images of sampleinput image 6(a), and figures 6(d) and 6(e) show the out-put of the color detector without and with the backgroundsubtraction module respectively. Background subtractionis simply implemented using a unimodal backgroundmodel,followed by color skin detection and finally afloodfill step. Figure 6(f) shows the final hand region after floodfill filtering.

1) Color Detection Module: It is well known thathuman skin color is localized in color space. In [10],it is shown that the distribution of skin color tones ismore localized inHSL color space thanRGB. There are



different models used including parametric and nonpara-metric skin distribution models as well as explicit definingskin color region. We choose the explicit definition of theskin region in color space due to its speed. A survey ondifferent modelling methods can be found in [11].

We divide the process into two steps. In the first step,a superset of the real skin area is selected by limitingthe hue component of the color. Thereafter, candidatepixels are analyzed one by one using a neural network[10] already trained with some sample skin colors to ruleout spurious pixels.

B. Palm Region Segmentation

To extract the palm from the segmented hand region,we rely on the observation that the area of the palm isusually the widest part of the hand with the exceptionof some of the upper areas of the arm. Also, due to thepresence of the fingers, the number of curvature maximain the neighborhood of the palm is more than the armareas. These facts allow us to model the area of the palmas a union of a set of intersecting circles.

The following summarizes the estimation process:

1) Segment the area of the hand as explained in sectionIII-A.

2) Find the largest interior circle (LIC) of the seg-mented area using the distance transform. Thiscircle is likely to be located on the palm. Howeverto avoid circles in the area of the arm, we find thecenter of gravity of the curvature maxima of thehand contour and consider only those circles thatcontain this point. Since the fingers create morecurvature maxima than the smooth edges of thestraight arm, this tends to place the center pointon the palm.

3) Find other large interior circles with a radius largerthan a given threshold (e.g. 0.8 of the radius of theLIC). The fingers inherently will not belong to acircle with such a radius even if a few of them arejoined; To avoid including circles on the arm, wediscard circles that do not intersect the LIC.

4) Compute the union of the area of all the obtainedcircles and consider it as the estimated area of thepalm. We do not expect this area to cover the palmperfectly. Also, the largest interior circles in thetwo images may not exactly correspond to the sameactual hand region. Nevertheless, they will have ahigh percentage of overlap. Figure 7(a) shows thelarge interior circles for the image of the left camerashown in figure 5 and figure 7(b) shows the finalregion of interest as the union of the circles.

An alternative approach for segmenting the centralregion of the hand is modelling it with a square asdiscussed in [7].

IV. PARAMETRIC DISPARITY MAP ESTIMATION

To reconstruct the position of the hand in 3-D, weestimate the disparity map from stereo. There are different

(a) (b)

Figure 7. Segmented Palm Regions and Largest Interior Circles ofInput Images of Figure 5: (a) Largest interior circles, (b) Final regionof interest.

sources of noise in the disparity estimation process. Thetwo cameras usually have different levels of brightness,white balance, contrast. which makes the matching pro-cess challenging. Also, the low texture of the hand addsto this problem. The rectification process also causessome deviations in the pixel values. Figure 8 shows theestimated disparity map for the sample image pair beforethe modelling. To cope with noise issues we introduce aparametric disparity model.

Figure 8. Disparity Map of the Pair of Images in Figure 5.

A. Disparity Map Modelling

We model the palm as a 3-D plane

Z = C1X+C2Y +C3 = C1(x

fZ)+C2(

y

fZ)+C3 (1)

whereP (X,Y, Z) is a point on the plane andp(x, y, f) isthe image of pointP on the image plane withf denotingthe focal length of the camera. SinceZ is inverselyproportional to the disparity valued (i.e.Z = α

dfor some

valueα)

d =α

C3+ (−C1α

fC3)x+ (−C2α

fC3)y = c1x+ c2y+ c3 (2)

which means that points(x, y, d) obtained from the dis-parity map should also lie on a plane.

To cope with outliers, we employ robust estimation tofind the parameters of the planar model. M-estimationis a robust method of estimating the regression planewhich works well in the presence of significant outliers.Considering the plane model

di = c1xi + c2yi + c3 + ei = xTi c + ei (3)

with xi = (xi, yi, 1)T and c = (c1, c2, c3)T , the gen-

eral M-estimator which corresponds to themaximum-



likelihood estimator[12], minimizes the objective func-tion

n∑

i=1

ρ(ei) =

n∑

i=1

ρ(di − xTi c) (4)

wheren is the number of points andρ is the influencefunction [13].

Let ψ = ρ′ be the derivative ofρ. To minimize (4), weneed to solve the system of three equations

n∑

i=1

ψ(di − xTi c)xT

i = 0 (5)

Defining the weight coefficientswi = ψ(ei)/ei, theestimating equations may be rewritten as

n∑

i=1

wi(di − xTi c)xT

i = 0 (6)

The solutionc to (6) can be found using the iterativelyreweighted least-squares, IRLS [13].

For fitting the 3-D plane to our disparity data, wechoose theGeman-McClurefunction for ρ [14]

ρ(x, σ) =x2

σ + x2(7)

Since this function has a differentiableψ-function, itprovides a more gradual transition between inliers andoutliers than other influence functions [15].

To achieve fast convergence as well as to avoid localminima, we initialize weightsw(0)

i with values propor-tional to the confidenceof each point in the disparitycalculation process. This confidence can be defined asthe reciprocal of the sum of the differences of the pixelvalues in the correlation windows.

V. PARAMETRIC MOTION FIELD ESTIMATION

Calculating the disparity map and modelling it at eachframe enables us to estimate the hand plane in spaceinstantaneously; however, it does not provide a one toone mapping of the points on the planes in consecutiveframes, which is required for tracking. Motion analysis isemployed to recover this information. The motion fieldis modelled using a similar approach to that used fordisparity modelling. Letπ be a moving plane in spacewith translational velocityt and angular velocityω. Itis well known [16] that components of the motion fieldv = (u, v)T can be computed as

u = 1fd

(a1x2 + a2xy + a3fx+ a4fy + a5f

2)

v = 1fd

(a1xy + a2y2 + a6fy + a7fx+ a8f

2)(8)

wheref is the focal length,d is the distance betweenπand the origin (the center of projection) and

ai = gi(t, ω, d,n) 1 ≤ i ≤ 8

wheregi(.) is a known function and the unit vector normalto π denoted asn.

Defining new coefficientsbi as

b1 = a1

fdb2 = a2

fdb3 = a3

db4 = a4

d

b5 = a5fd

b6 = a6

db7 = a7

db8 = a8f

d

(9)

equation (8) can be rewritten as

u = b1x2 + b2xy + b3x+ b4y + b5

v = b1xy + b2y2 + b6y + b7x+ b8

(10)

If we define a new matrixX and a new vectorb as

X =

[

x2 xy x y 1 0 0 0xy y2 0 0 0 y x 1

]

(11)

b = (b1, b2, b3, b4, b5, b6, b7, b8)T

equation (10) can be rewritten as

v = Xb (12)

Now, having a set ofn points pi = (xi, yi) and theircalculated motion vectorsvi = (ui, vi)

T , we can computeXis thereafter defining the new vectorvall and the newmatrixXall as a combination ofvis andXis respectively:

Xall =[

XT1 X

T2 ... X

Tn

]T

vall =[

vT1 v

T2 ... v

Tn

]T (13)

Equation (12) can be now generalized as:

vall = Xallb (14)

Using M-Estimation in a similar way as in section IV-A,we can find the coefficient vectorb.

A. Motion Field Adjustment based on the Disparity

If we define imageI as a function of spatial variablesx andy and temporal integer variablet, the motion fieldin a stereo system can be written as:

Il(x, y, t) = Il(x+ ul, y + vl, t+ 1)Ir(x, y, t) = Ir(x+ ur, y + vr, t+ 1)

(15)

where indicesl andr distinguish left and right cameras.Meanwhile from stereo constraints in the rectified im-

age pairs we know:

Il(x, y, t) = Ir(x− d, y, t)Il(x+ ul, y + vl, t+ 1) = Ir(x+ ul − d′, y + vl, t+ 1)

(16)with d andd′ showing disparity values at timest and

t+ 1 respectively.From (15) and (16) we can deduce:

ul + d− d′ = ur vl = vr (17)

The parameters of the motion model (8) were esti-mated for each camera individually in section V. Dueto mismatching and inherent deviation of the palm froma plane, the conditions in (17) are not exactly satisfied.We modify the motion vectors to satisfy (17) to the bestprior to calculating the motion coefficients as follows:

We selectn sample pointsptil from the region of

interest on the left image at timet, find conjugate pointspt

ir on the right image using the disparity model and thecorresponding pointspt+1

il and pt+1ir at time t + 1 for

the left and right images respectively using the modelledmotion fields. Then, we compute a new set of points,qt+1

ir ,



using pt+1il and the disparity model at timet + 1. Now,

points pt+1ir are replaced by a weighted average ofpt+1

ir

and qt+1ir based on their fitness as measured by window

intensity matching. We then repeat all the measurements,exchanging the roles of the left and right cameras, andcontinue until points reach stable locations.

Using the motion vectors found by the new pointlocations, we can estimate more accurate motion coef-ficient vectorsbr and br. It is worth noting that eventhough the above algorithm can enhance the quality ofthe motion vectors, it requires additional processing time.Therefore for applications where the accuracy of theinitial estimation is sufficient, we skip this optimizationstep.

VI. ESTIMATING 3-D PALM POSITION AND

ORIENTATION

The disparity plane can be mapped onto the palm plane.By locating a coordinate frame on this plane, the positionand orientation of the palm can be calculated. Initially,we assume that there is no motion information provided.In section VII, we show how the motion informationimproves this coordinate frame assignment and palm poseestimation.

We find the palm plane in 3-D using the calibrationinformation and the disparity plane. Having found thecoefficients(c1, c2, c3) of the disparity plane, we can use(2) to find (C1, C2, C3) the coefficients of the hand planein 3-D as defined in (1), when we have rectified images. Asimple method for performing this mapping for unrectifiedimages is to find three points lying on this plane in 3-Dand then fit a plane to these three points. To find points in3-D, we identify corresponding points from the disparityplane and use a simple triangulation process with thecamera calibration information [16].

We define the palm plane as the transformed planefound after two rotations and one translation applied tothe camera X-Y plane. Specifically, we rotate the X-Yplane with equationZ = 0 first about theX axis andthen about theY axis (i.e.,yawandpitch) to transform itto Z = C1X + C2Y . Then, we translate the plane alongtheZ axis by a constant valueC3 which makes the planeequationZ = C1X + C2Y + C3. Coefficient values forC1, C2 which were already found through the plane fittingprocess, are used to determine the two rotation anglesψ and θ corresponding toyaw and pitch respectively asfollows:

ψ = tan−1( C2√1+C2

1

) θ = tan−1(−C1)

(18)

Using the two rotation anglesψ and θ, and the trans-lation vector(0, 0, C3)

T , we compute the transformationmatrix P , which transforms the X-Y plane to the hand

plane

P =

cos(θ) sin(θ)sin(ψ) sin(θ)cos(ψ) 00 cos(ψ) −sin(ψ) 0

−sin(θ) cos(θ)sin(ψ) cos(θ)cos(ψ) C3

0 0 0 1

(19)This matrix will be used in later stages of processing.

The next step is to assign a coordinate frame to thepalm where the X-Y plane of this frame resides on themodel plane. This coordinate frame provides the 6 param-eters required to determine the position and orientation ofthe hand in 3-D. To determine the position of the hand,we need to assign the origin of the frame to a fixed pointon the palm. A good point is the center of the palmwhich can be approximated by the center of mass of theestimated area of the palm, built as the union of the set ofcircles as explained in section III-B. The position of theorigin O = (OX , OY , OZ)T in 3-D is calculated througha simple triangulation process.

The rotation of the hand about theZ axis of the palmframe, theroll , can be computed using the orientationof the 2D silhouette points of the hand in the X-Yplane. Ignoring some infrequent cases where the arm ishidden and all fingers but the thumb are bent,roll can becomputed as the angle of the axis of the least moment ofinertia [17] and is calculated as

φ =1

2tan−1(

2µ1,1

µ2,0 − µ0,2) (20)

whereµp,q =

∑

(x,y)

∑

∈R

(x− x)p(y − y)q

andx andy are average values over the regionR, whichincludes the whole segmented region of the hand. Thisprovides us with the direction of the arm or the roughdirection of the fingers in case the arm is missing, and isa good approximation of the true handroll .

VII. T RACKING A REFERENCEPOINT IN 3-D

Following the method discussed in the previous sectionover time, we can track the hand motion in 3-D. However,since the location of the center of the hand is not deter-mined accurately, the two center points in consecutiveframes are not necessarily images of the same 3-D pointon the hand. This causes jumps in the hand trajectory.Instead, we track a fixed 3-D reference point with the helpof the modelled motion field to measure the trajectory ofthe hand in space. The choice of such a reference pointis not critical; however, tracking a point in the center ofthe segmented region is more reliable than points on theboundary, since we might lose the boundary points as aconsequence of hand rotation. Also, given the physicalaxes of the hand, the effect of rotation on the locationof a point is smallest near the hand center. This point istracked indirectly by employing the parameterized planarhand region tracking. We estimate the motion model ofthe whole central region of the hand and reduce the impactof outliers using the method discussed in section V. The



new position of the reference point is computed from themotion model and is mapped to a 3-D point in space usingthe parametric model of the disparity map.

VIII. E XTRACTING THE DRAWN SEGMENT IN THE

DRAWING APPLICATION

The position and orientation of the hand as calculatedso far are essential to the applications we explore. How-ever, for the virtual drawing application, we also needsome additional analysis as explained below.

A. 3-D to 2D Conversion

Tracking the reference point in 3-D over time gives usa set of points:

L3 = {P t = (Xt, Y t, Zt)|1 ≤ t ≤ T} (21)

with no guaranteed connectivity. In fact the distancebetween two consecutive points is determined by thespeed of the hand, which is not uniform. Also, we cannotexpect the user to move exactly on a plane while she iswriting virtually in space. Therefore we convert the set of3-D pointsL3 to the best approximated setL2 in 2D. Weuse the robust estimation method defined in section IV-Afor this. This allows the user’s hand to shake or move inan unexpected way.

As mentioned earlier, the distribution of the referencepoints on the plane is non-uniform due to variable speedof the hand motion. This might bias the plane towardthe locations where the hand is moving more slowly. Tocope with this variation, a re-sampling of the points inL3

is performed. Neighboring points are connected using astraight line and then the resulting edge image is sampleduniformly to construct a set of pointsLu

3 with uniformdistances in 3-D.

B. On-Plane vs. Off-Plane

An essential part of our virtual writing system is todistinguish between on-plane and off-plane states. The on-plane finishing frame is recognized automatically whereasthe on-plane starting frame requires the cooperation of theuser. The user needs to hold her hand still for a few framesso that the system can detect it as a sign of the start ofwriting. Thereafter, the system starts fitting planes to thepoint setLu

3 and incrementally fits the plane in subsequentframes. When it detects a significant deviation from thefitted plane for the last few frames, it recognizes it asa sign of drawing termination. The user usually lifts thehand from the board after writing a letter or drawing ashape; however this action needs to be more conspicuousin virtual writing than when writing on a real plane.To achieve better performance, we fit a planar model toall the points except the last few (to prevent off-planepoints from deviating the plane from its true positionand causing off-plane detection failure - See figure 9).It is worth noting that a similar test could be used torecognize on-plane starting point, where the tracked pointresides on a plane for a few frames; however this needs

more cooperation from the user with more controlledmovements. Informal testing indicated that users foundit more natural to remain still for a few frames.

Figure 9. LastN2 points are supposed to be out of the plane.

C. Drawing Algorithm

To extract a user’s drawing from the set of pointsL3,the following steps are taken at each time instant:

1) The points in the setL3 are connected sequentiallyusing straight lines and the resulting edge shape isre-sampled uniformly to produce the point set

Lu3 = {Pi = (Xi, Yi, Zi)|1 ≤ i ≤M}

where M depends on the sampling rate and ispreferably larger thanT .

2) If the system is in the off-plane state (which is theinitial state), it checks whether there has been anysignificant displacement in the lastN1 points. Forthis purpose, a parameterD1 is calculated:

D1 = max(‖Pi − Pi−1‖) M −N1 ≤ i ≤M

and we switch to on-plane state and reset the indext to one ifD1 is less than a certain threshold.

3) Otherwise, if the system is in the on-plane state, wefit the best plane to the subsetLon

3 of points inLu3

Lon3 = {Pi = (Xi, Yi, Zi)|Pi ∈ Lu

3 , 1 ≤ i ≤M−N2}using M-Estimation as discussed in section IV-A.As mentioned earlier, the reason for excluding thelastN2 points is that we do not want the potentialoff-plane points to bias the plane (See figure 9). Thefitted plane is

Z = α1X + α2Y + α3 (22)

4) ParameterD2 is calculated based on the distance ofthe lastN2 points to the plane:

D2 = min(‖Pi − P proji ‖) M −N2 < i ≤M

whereP proji is the projection of pointPi on the

estimated plane computed as

P proji = Pi + λ(α1, α2,−1)

with λ calculated as

λ = −α1Xi + α2Yi + α3 − Zi

α21 + α2

2 + 1



If parameterD2 is larger than a threshold, thenwe switch to the off-plane mode indicating that thedrawing of one segment is completed. The index,t, is reset to one.

5) If a segment of drawing is just recognized (i.e.Lon3 ),

rotate it such that it resides on a plane parallel tothe image plane and denote the new point set asLrot

3

Lrot3 = {Pi = (Xrot

i , Y roti , Zrot

i )|1 ≤ i ≤M−N2}

The points in this set should satisfy the condition

Zrot1 = Zrot

2 = ... = ZrotM−N2

6) Define the new set of 2D pointsL2 as

Lun2 = {pi = (Xrot

i , Y roti )|1 ≤ i ≤M −N2}

7) Normalize the size and location of the point setLun2

to obtain the final output point setL2.

D. Multi-Segment Drawing with Feedback

(a) (b)

Figure 10. Drawing multi-segmented shapes in 3-D: (a) desiredshapesand output of the system, (b) The 3-D scene.

Even though the system discussed so far works wellfor writing alphanumeric characters, where most letterscan be drawn using one segment (no off-plane state in themiddle), for drawing shapes, the hand moves between off-plane and on-plane states. Therefore, disjoint segmentsneed to be positioned and sized correctly relative to eachother. In virtual writing in space, the user does not needto have any visual reference and the system does notconsider the location and size of the letters, as these are allnormalized in the recognition module. However, when weare dealing with more than one segment of drawing, theuser needs a display which we call anoutput boardwherethe system provides visual feedback about the currentposition of the user’s hand with respect to the already

drawn shapes. Also, the user needs to see all the drawnsegments to adjust the size and position. The systemworks as follows: After drawing each segment and fittingthe imaginary plane in 3-D, the drawn segment is rotatedto reside on a plane parallel to the image plane. Also, tokeep the size of all the segments proportional to the firstone, each rotated segment is projected orthographicallyto the first projected plane so that the same perspectiveratios are applied to all segments while the picture shownon the output board is created. As a specific example,assume we would like to draw two overlapping circleswith different sizes as shown in figure 10(a). The stepsinvolved in creating this drawing are as follows:

1) The user starts drawing the bigger circle (shown asdashed and outlined in figure 10(b)) on an imagi-nary plane. According to the algorithm in sectionVIII-A, the best plane in 3-D is fitted to the points.The drawn circle and the fitted plane are calledC1

andπ1 respectively.2) C1 is rotated intoC ′

1 which resides on planeπ′

1, aplane parallel to an output board we place on imageplaneπi.

3) The circleC ′

1 is now projected onπi through aperspective transformation.

4) Finishing the drawing of the first circle and movingaway from the estimated planeπ1, the user switchesto the off-plane state where his hand moves freelyin space to prepare for the next segment. During thisperiod, the reference point on the central region ofthe hand is projected orthographically to the planeπ′

1 and thereafter perspectively toπi. This gives theuser instantaneous feedback of the starting point ofthe next circle. The reason for using an orthographicprojection is that the user does not require perspec-tive projection while drawing in space. On the otherhand, the user does not adjust the size of the desiredshape with respect to its distance to the camera.Getting real-time feedback from the system, theuser moves the hand so that its projection on theboard goes to the desired location.

5) The hand stops at pointS; its orthographic projec-tion on planeπ′

1 is denoted asS′. The perspectiveprojection ofS′ goes tos on the board where theuser observes the result. Thereafter, the user remainsstill for a few frames to let the system know that anew segment is starting.

6) The user draws the second circleC2 and the secondfitting planeπ2 is estimated in the same way as theprevious circle. It is worth noting that there is norelationship between planesπ1 andπ2 as they areboth imaginary planes in space and the user neednot keep their positions in mind.

7) Circle C2 rotates about pointS such that rotatedcircleC ′

2 resides on planeπ′

2 parallel toπ′

1 andπi.The reason for usingS as the center of rotation isthat its position on the rotated shape remains thesame, and consequently, its position with respect tocircle C ′

1 is unaltered.



8) Applying orthographic projection to circleC ′

2 weobtain circleC ′′

2 on planeπ′

1.9) Perspective projection is performed for circle′′

2 toadd it to the output board where the projection ofC ′

1

already resides. The result is two overlapped circleswith the desired size ratio, as shown in figure 10(a).

An example of drawing a multi-segment shape is shownin the next section.

IX. EXPERIMENTS AND RESULTS

A. Hand Position and Orientation Estimation

To measure the accuracy of the proposed technique,we compare it with a hand model computed using a setof markers on the palm, finding their positions on theimages manually. We compute the coordinates of thosepoints in 3-D and fit a plane to them. Figure 11 shows asample image with markers. As depicted in the figure,the positions of the markers are selected so that theycover the area of the palm uniformly. This provides us abetter comparison as the region-based method picks pointsuniformly.

Figure 11. A sample image with markers.

0 5 10 15 20 25 30−30

−20

−10

0

10

20

30

40

50

Frame Number

Ox

(Mill

imet

er)

Region−BasedMarker−Based

(a)

0 5 10 15 20 25 300

5

10

15

20

25

30

35

40

45

50

Frame Number

Oy

(Mill

imet

er)


(b)

0 5 10 15 20 25 30460

470

480

490

500

510

520

530

540

550

560

Frame Number

Oz

(Mill

imet

er)


(c)

0 5 10 15 20 25 30−45

−40

−35

−30

−25

−20

−15

−10

−5

0

5

Frame Number

Yaw

Ang

le (

Deg

ree)


(d)

0 5 10 15 20 25 30−20

−10

0

10

20

30

40

Frame Number

Pitc

h A

ngle

(D

egre

e)


(e)

0 5 10 15 20 25 30−10

−5

0

5

10

15

20

25

Frame Number

Rol

l Ang

le (

Deg

ree)


(f)

Figure 12. Experimental results. Top: Marker and region-based positionvalues: (a) X, (b) Y, (c) Z, Bottom: Marker and region-based orientationvalues: (d) Yaw, (e) Pitch, (f) Roll.

Figure 12 shows the position coordinatesOX , OY andOZ and orientation anglesyaw, pitch androll denoted asψ, θ, and φ of this marker-based planeas well as theregion-based planeestimated through disparity analysis.A sequence of 30 frames was used for this experiment.The results are shown in table I.

TABLE I.STATISTICAL RESULTS

mean absolute standard deviationdifference of the absolute difference

OZ 1.8135mm 0.9215mmOY 1.0514mm 0.4740mmOZ 2.0792mm 4.0983mmψ 5.1570◦ 3.2986◦

θ 6.9515◦ 5.3280◦

φ 3.3571◦ 1.9242◦

Although, the marker-based plane passes through a setof reliable points, this plane may not be the optimal planeas the shape of the palm is not exactly a plane. For thisreason, we do not regard the marker-based plane as aground truth plane; In fact, we believe that the planeestimated through disparity analysis is a better approx-imation, giving us more reliable position and orientationparameters.

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Disparity Error

Pop

ulat

ion

Rat

io

Figure 13. Distribution of the error of disparity values with respect todisparity plane. (Averaged over 30 frames)

Another useful parameter that assesses the accuracy ofour algorithm captures the distribution of the disparityerrors, which measures how far the disparity points arefrom the fitted disparity plane. In other words how manyof the points are outliers. This is an important issuebecause the M-estimation algorithm breaks down if thepercentage of outliers is too high and then it divergesfrom the optimal plane drastically. Figure 13 shows thedistribution of the errors measured by averaging thecorresponding distributions over a 30 frame sequence. Itis a normal distribution with mean 0.0050 and standarddeviation 0.0237 which gives us a 35% rate of outliers ifwe define inlier-outlier threshold as 1.5 and a 9% outlierrate if the threshold is 2.5 levels of disparity. Therefore,the M-estimation algorithm is convergent.

Figures 14 shows sample frames selected from differentimage sequences showing a hand in motion. The leftimage of the image-pairs along with the correspondingmodels built based on the estimated position and orien-tation of the hand are depicted in the figures. Differentframes show a variety of cases to which the proposedmethod is applicable. There are frames where the fingersare moving freely, and we still track the palm. Hands fromdifferent people in figure 15 also shows the applicabilityof the method in low-textured as well as high-texturedcases. It also indicates that our algorithm works on thefront of the hand as well as the back of the hand.

B. Virtual Drawing Application

To measure the accuracy of the method, sequenceswere taken where a person was actually writing on paper



Figure 14. Experimental results: Sample input frames along withcorresponding estimated models.

Figure 15. Experimental results: Sample frames showing front and backof the hands as well as high-textured and low-textured hands.

using a pen. Comparison of the letters extracted from ourvision-based system and the real letters written on thepaper shows how well our tracking method correspondsreference points on the hand to the pen point tracks. Infigure 16, the outputs of our system, as well as the letterson the paper scanned as digital images, are shown. Figure17 shows frames picked from the beginning, middle andend of the writing for the sample letterZ.

TheChamfer distanceis used to measure the similarityof the two shapes. It is a well-known method to measurethe distance between two edge imagesX andY:

c(X,Y) =1

| X |∑

x∈X

miny∈Y

‖ x− y ‖ (23)

To make the measure independent of the size of theimage, the distance is normalized by dividing it by thelargest dimension of the shape (i.e., the distance of thetwo furthest points):

cn(X,Y) =c(X,Y)

maxx1,x2∈X ‖ x1 − x2 ‖ (24)

A more accurate measure, bidirectional Chamfer dis-tance, is defined as:

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

Figure 16. (Top Row) Vision-based estimation, (Middle Row) Paper-based output (Bottom Row) overlay of vision-based estimation on paperoutput.

Figure 17. Input and output of some sample frames: (Left) Beginning,(Middle) Middle, (Right) End.

Cn(X,Y) =cn(X,Y) + cn(Y,X)

2(25)

As the two shapes extracted from the two differentmethods might be in different position, orientation andscale, we need to find the best translation vectort =(tx, ty), rotation angleθ and scaling factorss = (sx, sy)which minimize the distance thereby maximize the simi-larity of the two shapes:

Cns = mint,θ,s

Cn(TX,Y) (26)

The distance measuresCns for the letters shown infigure 16 are listed in Table II.

TABLE II.CHAMFER DISTANCE RESULTS

Letter Cns(.) ∆ψ (Degree) ∆θ (Degree)M 0.0081 -3.90 0.72R 0.0140 3.19 0.43S 0.0147 -1.78 4.39Z 0.0062 -1.98 0.95

We employ a second measure for the accuracy ofthe estimation: The orientation of the paper in 3-D isestimated through 4 marker points (see Figure 17) drawnon the paper and is then compared with the orientation ofthe estimated plane (i.e. the final fitted plane to the set ofreference points). The orientation anglesyaw and pitch



for the two planes denoted asψ and θ respectively arecomputed from the 3-D plane equation (22) as:

ψ = tan−1( α3√1+α2

2

) θ = tan−1(−α2) (27)

Table II shows the difference of the orientation anglesfor the two planes in degree.

We applied our virtual writing method to all of theEnglish alphabet letters as well as digits in a continuouswriting process (i.e., writing consecutively from A toZ) in space. The quality of the results was good andwe anticipate that an OCR algorithm which recognizeshandwritten letters can convert the images into codedcharacters. Figure 18 shows output of the program forEnglish letters.

Figure 18. Output of the program for English letters.

Figure 19. A sample frame of the sequence drawing a face showinghow user gets real-time visual feedback

Next, we illustrate drawing a multi-segment face en-hanced by the real-time feedback to the user as explainedin section VIII-D. The user is also provided with threevirtual buttons so he can choose the pen color by movinghis hand to the area of the buttons added to the outputboard. Figure 19 shows a sample frame and the outputdrawn face. It also shows how the user obtains real-timefeedback from the system through the monitor. In fact,he can see the live images taken by the cameras as wellas the current state of the output board. He also observesthe color buttons in the left side of the output board sohe can select the pen color.

Our virtual drawing system requires minimal coopera-tion from the user. However, we have not conducted userstudies to assess the performance and fatigue that mayoccur in long term use.

C. Virtual Marble Game

Our implementation of the virtual marble game esti-mates the absolute orientation of the hand at each frame

Figure 20. Sample frames of the virtual marble game.

and applies it to the model, as shown in figure 20. To makethe game more intuitive to the user, the initial frame isconsidered as a reference so that at each frame the modelis rotated as much as the difference between orientationsin the current frame and the reference frame. Tracking andvisual feedback at a rate of about 10 frames per secondenables the user to see the current state, decide and tilt thehand to navigate the ball comfortably. To make the gamemore attractive, physical parameters such as bouncing asa result of collision and inertia could also be modelled.

Figure 21. Sample maze maps for virtual marble game

The flexibility of this virtual game comes from theability to change the map of the maze easily. Our systemmodifies it using a random maze generator. Figure 21shows a few sample maze maps. We can also manipulatethe coefficient of friction to adjust the level of difficultyof the game and make the navigation more challenging.This friction parameter cannot be easily changed in thephysical world.

D. 3-D Construction

We ran an experiment to illustrate the accuracy of themeasurements in 3-D drawing where we track the hand asit traverses the edges of a real object (see figure 4). Theuser tracks three orthogonal edges of a box and the systemtracks the hand and models its intended motion. A fewparameters calculated from hand tracking in the sequencewere compared with the ground truth measured from theactual box and the results are summarized in Table III.The measured parameters include the angle between thetwo planesp1 andp2, the angles between the linesl1 andl2 and the linesl2 andl3, and the length of the linesl1, l2and l3 as defined in figure 22. As indicated, the relativeerrors are small and are mostly due to camera calibrationinaccuracy as well as shaking of the hand holding thebox.

X. SUMMARY

We proposed a set of applications and associated algo-rithms for using the hand as an interface for drawing and



TABLE III.PARAMETER COMPARISON BETWEEN HAND TRACKING APPROACH

AND ACTUAL BOX MEASUREMENTS

Parameter Nominal Val. Measured Val. Rel. ErrorAngle(p1, p2) 90◦ 88.25◦ 1.94%Angle(l1, l2) 90◦ 93.19◦ 3.33%Angle(l2, l3) 90◦ 92.95◦ 2.22%Length(l1) 238mm 208mm 12.18%Length(l2) 132mm 127mm 3.79%Length(l3) 120mm 106mm 10.83%

Figure 22. The tracked box and defined measurement parameters.

control. These applications included virtual writing anddrawing in space, a virtual marble game, and 3-D objectconstruction. Using stereo cameras, a sequence of imagepairs is acquired and analyzed to estimate the position andorientation of the hand in 3-D. We estimate the disparitymap and motion field and model them to reduce theimpact the low-textured hand and noise. Planar modellingof the hand requires disparity values to reside on a planetoo. A plane in motion defines a quadratic model forthe motion field, where model parameters are estimatedusing robust estimation and adjusted to comply with thedisparity model. Tracking the trajectory of the hand inspace provides sufficient information for the applications.Many more applications could also be developed basedon the method presented.

REFERENCES

[1] L. Bretzner and T. Lindeberg, “Use your hand as a 3-d mouse ...”European Conference on Computer Vision,1998.

[2] J. J. Kuch and T. S. Huang, “Vision based hand modelingand tracking for virtual teleconferencing and telecollabo-ration,” International Conference on Computer Vision andPattern Recognition, pp. 666–671, 1995.

[3] N. Jojic, B. Brumitt, B. Meyers, S. Harris, and T. Huang,“Detection and estimation of pointing gestures in densedisparity maps,”International Conference on AutomaticFace and Gesture Recognition, 2000.

[4] Y. Cui and J. Weng, “A learning-based prediction-and-verification segmentation scheme for hand sign imagesequence,”IEEE Transactions on Pattern Analysis andMachine Intelligence, 1999.

[5] Z. Zhang, Y. Wu, Y. Shan, and S. Shafer, “Visual panel:Virtual mouse, keyboard and 3d controller with an ordinarypiece of paper,”Workshop on Perceptive User Interfaces,2001.

[6] Y. Nam and K. Wohn, “Recognition of space-time hand-gestures using hidden markov model,”ACM Symposiumon Virtual Reality Software and Technology, 1996.

[7] K. Abe, H. Saito, and S. Ozawa, “3d drawing system viahand motion recognition from two cameras,”Proceeding ofthe 6th Korea-Japan Joint Workshop on Computer Vision,pp. 138–143, January 2000.

[8] S. Wang, X. Xiong, Y. Xu, C. Wang, W. Zhang, X. Dai,and D. Zhang, “Face-tracking as an augmented input invideo games: enhancing presence, role-playing and con-trol,” in CHI ’06: Proceedings of the SIGCHI conferenceon Human Factors in computing systems. New York, NY,USA: ACM Press, 2006.

[9] T. Konrad, D. Demirdjian, and T. Darrell, “Gesture + play:full-body interaction for virtual environments,” inCHI ’03:Proceedings of the SIGCHI conference on Human Factorsin computing systems. New York, NY, USA: ACM Press,2003.

[10] X. Yin, D. Guo, and M. Xie, “Hand image segmentationusing color and rce neural network,”IJRAS, vol. 34, pp.235–250, March 2001.

[11] V. Vezhnevets, V. Sazonov, and A. Andreeva, “A surveyon pixel-based skin color detection techniques,”Proc.Graphicon-2003, pp. 85–92, September 2003.

[12] P. J. Huber,Robust statistics. John Wiley and Sons, 1981.[13] J. Fox,Robust Regression: Appendix to An R and S-PLUS

Companion to Applied Regression. SAGE Publications,2002.

[14] S. Geman and D. E. McClure, “Statistical methods fortomographic image reconstruction,”Proc. of the 46-thSession of the ISI, Bulletin of the ISI, vol. 52, pp. 5–21,1987.

[15] M. J. Black and P. Anandan, “The robust estimation ofmultiple motions: Parametric and piecewise-smooth flowfields,” CVIU, vol. 63, no. 1, pp. 75–104, Jan 1996.

[16] E. Trucco and A. Verri,Introductory Techniques for 3-DComputer Vision. Prentice Hall, 1998.

[17] A. K. Jain, Fundamentals of Digital Image Processing.Prentice Hall, 1989.

Afshin Sepehri is a Ph.D. candidate of Electrical Engineering atthe University of Maryland at College Park, USA. He receivedhis M.S. and B.S. degrees in Machine Intelligence and Robotics,and Computer Engineering from the University of Tehran, Iran,in 1998 and 1996, respectively. He was a faculty memberand lecturer at the Azad University and University of Tehran,Iran, 1998-2001. His research interests include multiple-viewcomputer vision and human body tracking.

Yaser Yacoob received his Ph.D. in Computer Science fromthe University of Maryland in 1994. Since then he has beena research scientist in the Institute for Advanced ComputerStudies. He served as program committee member on severalworkshops and computer vision conferences in recent years.

His research is focused on detection, tracking and recognitionof face expressions, lip reading and body part tracking.

Larry S. Davis received his B.A. from Colgate University in1970 and his M.S. and Ph.D. in Computer Science from theUniversity of Maryland in 1974 and 1976 respectively. From1977-1981 he was an Assistant Professor in the Departmentof Computer Science at the University of Texas, Austin. Hereturned to the University of Maryland as an Associate Professorin 1981. From 1985-1994 he was the Director of the Universityof Maryland Institute for Advanced Computer Studies. He iscurrently a Professor in the Institute and the Computer ScienceDepartment, as well as Chair of the Computer Science Depart-ment. He was named a Fellow of the IEEE in 1997.



employing the hand as an interface device...employing the hand as an interface device afshin...

Documents