mori mecv01

9
Estimating Human Body Congurations using Shape Context Matching Greg Mori and Jitend ra Malik Computer Science Division University of California at Berkeley Berkeley, CA 94720 mori,malik  @cs.berkeley.edu Abstract The problem we consider in this paper is to take a single two-dime nsion al image containin g a human body, locate the joint positions, and use these to estimate the body con-  figuration and pose in three-dimensional space. The ba- sic approach is to store a number of exemplar 2D views of the human body in a variety of different congurations and vie wpoint s with re spe ct to the camer a. On each of thes e stored views, the locations of the body joints (left elbow, right knee etc) are manually marked and labelled for future use. The test shape is then matched to each stored view, us- ing the technique of shape context matching. Assuming that there is a stored view sufciently similar in conguration and pose, the corr espon dence pr ocess will succeed. The locations of the body joints are then transferred from the exemplar view to the test shape . Given the joint locations , the 3D body conguration and pose are then estimated. We  present results of our method on a corpus of human pose data. 1 Intr oduc ti on As indi cated in Figure 1, the pr oble m we cons ider in this pa- per is to take a single two-dimensional image containing a human body, locate the joint positions, and use these to esti- mate the body conguration and pose in three-dimensional space. Variants include the case of multiple cameras view- ing the same human, tracking the body conguration and pose over time from video input, or analogous problems for other articulated objects such as hands, animals or robots. A robust, accurate solution would facilitate many different practical applications–e.g. see Table 1 in Gavrila’s survey paper[10]. From the perspective of computer vision theory, this problem offers an opportunity to explore a number of dif ferent tr adeof fs in –the role of low leve l vs. high lev el cues, static vs. dynamic information , 2D vs. 3D analysi s, etc. in a concrete setting where it is relatively easy to quan- tify success or failure. There has been considerable previous work on this prob- R hand R elbow R shoulder L shoulder L elbow L hand Head Waist R hip R knee R foot L hip L knee L foot (a) (b) (c) Fi gure 1: The goal of thi s wor k. (a) Input image. (b) Automatic ally ext racte d keypoi nts. (c) 3D rendering of esti mated body congurat ion. In this paper we present a method to go from (a) to (b) to (c). lem [10]. Broadly speaking, it can be categorize d into two majo r classe s. The rst set of approach es use a 3D model for estimating the positions of articulated objects. Pioneer- ing work was done by O’Rourke and Badler [18], Hogg[11] and Y amamoto and Koshikawa [25 ]. Rehg and Kanade [19] track very high DOF articulated objects such as hands. Bre- gler and Malik [5] use optical ow measurements from a video sequence to track joint angles of a 3D model of a human, using the product of exponenti als repr esentation for the kinematic chai n. Kakad iari s and Meta xas[1 5] use multi- ple cameras and match occluding contours with projections from a deformabl e 3D model. Gavr ila and Davis [9] is an- other 3D model based tracking approach, as is the work of Rohr [20] for track ing walki ng pedestr ians. It should be noted that pretty much all the tracking methods require a hand-initialized rst video frame. The second broad class of approach es does not explici tly work with a 3D model, rather 2D models trained directly from example images are used. There are several variations on this theme . Baumberg and Hogg[1] use acti ve shape model s to track pedes trian s. Wren et al. [24] track peopl e as a set of colored blo bs. Morri s and Rehg [17] describ e a 2D scaled prismatic model for human body registration. Ioffe and Forsyth [12] perform low-level processing to ob- 1

Upload: rosyid-ahmad

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 1/8

Estimating Human Body Congurations using Shape Context Matching

Greg Mori and Jitendra Malik Computer Science Division

University of California at BerkeleyBerkeley, CA 94720

mori,malik @cs.berkeley.edu

Abstract

The problem we consider in this paper is to take a singletwo-dimensional image containing a human body, locatethe joint positions, and use these to estimate the body con- guration and pose in three-dimensional space. The ba-sic approach is to store a number of exemplar 2D views of the human body in a variety of different congurations and viewpoints with respect to the camera. On each of thesestored views, the locations of the body joints (left elbow,right knee etc) are manually marked and labelled for futureuse. The test shape is then matched to each stored view, us-ing the technique of shape context matching. Assuming that there is a stored view sufciently similar in congurationand pose, the correspondence process will succeed. Thelocations of the body joints are then transferred from theexemplar view to the test shape. Given the joint locations,the 3D body conguration and pose are then estimated. We present results of our method on a corpus of human posedata.

1 Introduction

As indicated in Figure 1, the problem we consider in this pa-per is to take a single two-dimensional image containing ahuman body, locate the joint positions, and use these to esti-mate the body conguration and pose in three-dimensionalspace. Variants include the case of multiple cameras view-ing the same human, tracking the body conguration andpose over time from video input, or analogous problems forother articulated objects such as hands, animals or robots.A robust, accurate solution would facilitate many differentpractical applications–e.g. see Table 1 in Gavrila’s surveypaper[10]. From the perspective of computer vision theory,this problem offers an opportunity to explore a number of

different tradeoffs in –the role of low level vs. high levelcues, static vs. dynamic information, 2D vs. 3D analysis,etc. in a concrete setting where it is relatively easy to quan-tify success or failure.

There has been considerable previous work on this prob-

R handR elbow

R shoulderL shoulder

L elbow

L hand

Head

WaistR hip

R knee

R foot

L hip

L knee

L foot

(a) (b) (c)

Figure 1: The goal of this work. (a) Input image. (b)Automatically extracted keypoints. (c) 3D rendering of estimated body conguration. In this paper we present amethod to go from (a) to (b) to (c).

lem [10]. Broadly speaking, it can be categorized into twomajor classes. The rst set of approaches use a 3D modelfor estimating the positions of articulated objects. Pioneer-ing work was done by O’Rourke and Badler [18], Hogg[11]and Yamamoto and Koshikawa [25]. Rehg and Kanade [19]track very high DOF articulated objects such as hands. Bre-gler and Malik [5] use optical ow measurements from avideo sequence to track joint angles of a 3D model of ahuman, using the product of exponentials representation forthe kinematic chain. Kakadiaris and Metaxas[15] use multi-ple cameras and match occluding contours with projectionsfrom a deformable 3D model. Gavrila and Davis [9] is an-other 3D model based tracking approach, as is the work of Rohr [20] for tracking walking pedestrians. It should benoted that pretty much all the tracking methods require ahand-initialized rst video frame.

The second broad class of approaches does not explicitlywork with a 3D model, rather 2D models trained directlyfrom example images are used. There are several variations

on this theme. Baumberg and Hogg[1] use active shapemodels to track pedestrians. Wren et al. [24] track peopleas a set of colored blobs. Morris and Rehg [17] describea 2D scaled prismatic model for human body registration.Ioffe and Forsyth [12] perform low-level processing to ob-

1

Page 2: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 2/8

tain candidate body parts and then use a mixture of treesto infer likely congurations. Song et al. [21] use a sim-ilar technique involving feature points and inference on atree model. Toyama and Blake [23] use 2D exemplars totrack people in video sequences. Brand [4] learns a proba-bility distribution over pose and velocity congurations of the moving body and uses it to infer paths in this space.

In this paper we consider the most basic version of theproblem–estimating the 3D body conguration based on asingle uncalibrated 2D image. The basic idea is to store

a number of exemplar 2D views of the human body in avariety of different congurations and viewpoints with re-spect to the camera. On each of these stored views, thelocations of the body joints (left elbow, right knee etc) aremanually marked and labelled for future use. The test shapeis then matched to each stored view, using the shape con-text matching technique of Belongie, Malik and Puzicha[3]. This technique is based on representing a shape by aset of sample points from the external and internal contoursof an object, found using an edge detector. Assuming thatthere is a stored view “sufciently” similar in congurationand pose, the correspondence process will succeed. The lo-cations of the body joints are then “transferred” from the ex-

emplar view to the test shape. Given the joint locations, the3D body conguration and pose are estimated using Tay-lor’s algorithm [22].

The structure of the paper is as follows. In section 2we elaborate on the estimation approach described above.We show experimental results in section 3. We discuss theissue of models versus exemplars in section 4. Finally, weconclude in section 5.

2 Estimation Method

In this section we provide the details of the congurationestimation method proposed above. We rst obtain a set of boundary sample points from the image. Next, given a setof exemplars extracted from a training set (method for ob-taining exemplars is outlined in Appendix A), we nd thebest match among the exemplars. We use this match, alongwith correspondences between boundary points on the testimage and the exemplar, to estimate the 2D image positionsof 14 keypoints (hands, elbows, shoulders, hips, knees, feet,head and waist) on the test image. These keypoints can thenbe used to construct an estimate of the 3D body congura-tion in the test image.

2.1 Matching using Shape Contexts

In our approach, a shape is represented by a discrete set

,

, of points sampled from theinternal or external contours on the shape.

(a) (b) (c)

θ

l o g

r

θ

l o g

r

θ

l o g

r

(d) (e) (f)

Figure 2: Shape contexts. (a,b) Sampled edge points of twoshapes. (c) Diagram of log-polar histogram bins used incomputing the shape contexts. We use 5 bins for and12 bins for . (d-f) Example shape contexts for referencesamples marked by Æ in (a,b). Each shape context is

a log-polar histogram of the coordinates of the rest of thepoint set measured using the reference point as the origin.(Dark=large value.) Note the visual similarity of the shapecontexts for Æ and , which were computed for relativelysimilar points on the two shapes. By contrast, the shapecontext for is quite different.

We rst perform Canny edge detection [6] on the imageto obtain a set of edge pixels on the contours of the body.We then sample some number of points (around 300 in ourexperiments) from these edge pixels to use as the samplepoints for the body. Note that this process will give us not

only external, but also internal contours of the body shape.The internal contours are essential for estimating congura-tions of self-occluding bodies. See Figure 4 (a) for exam-ples of sample points.

For each point

on a given shape, we want to nd the“best” matching point

on another shape. This is a corre-spondence problem similar to that in stereopsis. Experiencethere suggests that matching is easier if one uses a rich localdescriptor. Rich descriptors reduce the ambiguity in match-ing.

The shape context was introduced in [3] to play such arole in shape matching. Consider the set of vectors originat-ing from a point to all other sample points on a shape. These

vectors express the conguration of the entire shape relativeto the reference point. Obviously, this set of vectors isa rich description, since as gets large, the representationof the shape becomes exact.

The full set of vectors as a shape descriptor is much too

2

Page 3: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 3/8

detailed since shapes and their sampled representation mayvary from one instance to another in a category. The distri-bution over relative positions is a more robust and compact,yet highly discriminative descriptor. For a point

on theshape, compute a coarse histogram

of the relative coor-dinates of the remaining points,

bin

This histogram is dened to be the shape context of

. Thedescriptor should be more sensitive to differences in nearbypixels, which suggests the use of a log-polar coordinate sys-tem. An example is shown in Fig. 2(c). Note that the scaleof the bins for is chosen adaptively, on a per shape ba-sis. This makes the shape context feature invariant to scal-ing.

As in [3], we use

distances between shape contexts asa matching cost between sample points.

We would like a correspondence between sample pointson the two shapes that enforces the uniqueness of matches.This leads us to formulate our matching of a test body toan exemplar body as an assignment problem (also known asthe weighted bipartite matching problem) [7]. We nd anoptimal assignment between sample points on the test bodyand those on the exemplar.

To this end we construct a bipartite graph (Figure 3). Thenodes on one side represent sample points on the test body,on the other side the sample points on the exemplar. Edgeweights between nodes in this bipartite graph represent thecosts of matching sample points. Similar sample points willhave a low matching cost, dissimilar ones will have a highmatching cost. -cost outlier nodes are added to the graphto account for occluded points and noise - sample pointsmissing from a shape can be assigned to be outliers for somesmall cost. We use the assignment problem solver in [14] tond the optimal matching between the sample points of the

two bodies.We compare the test body to all of the exemplars fromour training set. The exemplar with the lowest total match-ing cost is chosen for use in keypoint estimation.

Note that the output of more specic lters, such as faceor hand detectors, could easily be incorporated into thisframework. The matching cost between sample points canbe measured in many ways.

2.2 Locating Keypoints

The next step is to estimate the 2D image positions of the 14keypoints(hands, elbows, shoulders, hips, knees, feet, head,

waist) on the test body. From the solution to the assignmentproblem in section 2.1 we have correspondences betweensample points (not keypoints) on the test body and the clos-est exemplar body. In addition, each exemplar from thetraining set has user-clicked locations of keypoints.

B2 NodesB1 Nodes

Outlier Nodes

Sample PointNodes

ε

ε

C(i,j)

C(i,j)

C(i,j)

ε

C(i,j)

Figure 3: The bipartite graph used to match sample pointsof two bodies. Only the edges from the rst node are shownfor clarity. Each node from is connected to every nodefrom . In addition, -cost outlier nodes are added to ei-ther side. These outlier nodes allow us to deal with missingsample points between gures (arising from occlusion andnoise).

We would like to use these correspondences and exemplarkeypoint locations to estimate the keypoint positions on thetest body.

We use a simple method for this estimation process (Fig-ure 4). For each keypoint

on the exemplar we want toestimate its position on the test body

. We select a setof sample points

from the exemplar as a support forthis keypoint. The solution

to the assignment problemgives a corresponding set of points

on thetest body. We estimate the best (in the least-squares sense)transformation

that takes

to

. We thenapply this transformation to the keypoint:

.

In our experiments, the support for a keypoint is denedto be all sample points within a disc of some small radius.The transformation is simply a translation. One couldalternatively use afne or rigid transformations.

3

Page 4: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 4/8

2.3 Estimating 3D Conguration

We use Taylor’s method in [22] to estimate the 3D congu-ration of a body given the keypoint position estimates. Tay-lor’s method works on a single 2D image, taken with anuncalibrated camera.

It assumes that we know:

1. the image coordinates of keypoints

2. the relative lengths

of body segments connectingthese keypoints

3. a labelling of “closer endpoint” for each of these bodysegments

4. that we are using a scaled orthographic projectionmodel for the camera

We can then solve for the 3D conguration of the body

up to some ambiguity inscale . The method considers the foreshortening of eachbody segment to construct the estimate of body congura-

tion. For each pair of body segment endpoints, we have thefollowing equations:

To estimate the conguration of a body, we rst x one

keypoint as the reference point and then compute the posi-tions of the others with respect to the reference point. Sincewe are using a scaled orthographic projection model the

and coordinates are known up to the scale . All that re-mains is to compute relative depths of endpoints . Wecompute the amount of foreshortening, and use the user-supplied “closer endpoint” labels from the closest matchingexemplar to solve for the relative depths.

Moreover, Taylor notes that the minimum scale

canbe estimated from the fact that cannot be complex.

This minimum value is a good estimate for the scalesince one of the body segments is often perpendicular tothe viewing direction.

(a) (b)

KP exemplar

KP estimate

(a) (b)

Figure 4: Locating keypoints. (a) Sample points on exem-plar and test body, with lines showing correspondences. (b)

Support for estimating transformation of right foot, alongwith estimate of position.

4

Page 5: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 5/8

Page 6: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 6/8

References

[1] A. Baumberg and D. Hogg. Learning exible models fromimage sequences. Lecture Notes in Computer Science ,800:299–308, 1994.

[2] S. Belongie, J. Malik, and J. Puzicha. Shape context: Anew descriptor for shape matching and object recognition.In NIPS , November 2000.

[3] S. Belongie, J. Malik, and J. Puzicha. Matching shapes. In Eighth IEEE International Conference on Computer Vision ,volume 1, pages 454–461, Vancouver, Canada, July 2001.

[4] M. Brand. Shadow puppetry. Proc. 7th Int. Conf. Computer Vision , pages 1237–1244, 1999.[5] C. Bregler and J. Malik. Tracking people with twists and

exponential maps. Proc. IEEE Comput. Soc. Conf. Comput.Vision and Pattern Recogn. , pages 8–15, 1998.

[6] J. Canny. A computational approach to edge detection. IEEE Trans. PAMI , 8:679–698, 1986.

[7] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Al-gorithms . The MIT Press, 1990.

[8] H. Eguchi. Moving Pose 1223 . Bijutsu Shuppan-sha, 1995.[9] D. Gavrila and L. Davis. 3d model-based tracking of hu-

mans in action: A multi-view approach. Proc. IEEE Com- put. Soc. Conf. Comput. Vision and Pattern Recogn. , pages73–80, 1996.

[10] D. M. Gavrila. The visual analysis of human movement: Asurvey. Computer Vision and Image Understanding: CVIU ,73(1):82–98, 1999.

[11] D. Hogg. Model-based vision: A program to see a walkingperson. Image and Vision Computing , 1(1):5–20, 1983.

[12] S. Ioffe and D. Forsyth. Human tracking with mixtures of trees. In Proc. 8th Int. Conf. Computer Vision , volume 1,pages 690–695, 2001.

[13] N. Jojic, P. Simard, B. Frey, and D. Heckerman. Separatingappearance from deformation. Proc. 8th Int.Conf. Computer Vision , 2:288–294, 2001.

[14] R. Jonker and A. Volgenant. A shortest augmenting pathalgorithm for dense and sparse linear assignment problems.Computing , 38:325–340, 1987.

[15] I. Kakadiaris and D. Metaxas. Model-based estimation of 3d human motion. IEEE Trans. PAMI , 22(12):1453–1459,2000.

[16] G. Mori, S. Belongie, and J. Malik. Shape contexts en-able efcient retrieval of similar shapes. To appear at CVPR2001.

[17] D. Morris and J. Rehg. Singularity analysis for articulatedobject tracking. In Proc. IEEE Comput. Soc. Conf. Comput.Vision and Pattern Recogn. , pages 289–296, 1998.

[18] J. O’Rourke and N. Badler. Model-based image analysis of human motion using constraint propagation. IEEE Trans.PAMI , 2(6):522–536, 1980.

[19] J. M. Rehg and T. Kanade. Visual tracking of high DOF ar-ticulated structures: An application to human hand tracking. Lecture Notes in Computer Science , 800:35–46, 1994.

[20] K. Rohr. Incremental recognition of pedestrians from imagesequences. In CVPR93 , pages 8–13, 1993.

[21] Y. Song, L. Goncalves, and P. Perona. Monocular perceptionof biological motion - clutter and partial occlusion. In Proc.6th Europ. Conf. Comput. Vision , 2000.

[22] C. J. Taylor. Reconstruction of articulated objects frompoint correspondences in a single uncalibrated image. Com- puter Vision and Image Understanding (CVIU) , 80:349–363, 2000.

[23] K. Toyama and A. Blake. Probabilistic exemplar-basedtracking in a metric space. In Proc. 8th Int. Conf. Computer Vision , volume 2, pages 50–57, 2001.

[24] C. Wren, A. Azarbayejani, T. Darrell, and A. Pentland.Pnder: Real-time tracking of the human body. IEEE Trans.PAMI , 19(7):780–785, July 1997.

[25] M. Yamamoto and K. Koshikawa. Human motion analysis

based on a robot arm model. Proc. IEEE Comput. Soc. Conf.Comput. Vision and Pattern Recogn. , pages 664–665, 1991.

6

Page 7: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 7/8

R handR elbowR shoulder

L shoulder

L elbowL hand

Head

WaistR hip

R knee

R foot

L hip

L knee

L foot

R hand

R elbow

R shoulderL shoulder

L elbow

L hand

Head

Waist

R hipR knee

R foot

L hip

L knee

L foot

(a) (b)

Figure 5: Example renderings. (a) Original image with lo-cated keypoints. (b)3D rendering(green is left, red is right).

R hand

R elbowR shoulder

L shoulder

L elbow

L hand

Head

WaistR hip

R knee

R foot

L hip

L knee

L foot

R hand

R elbowR shoulder L shoulder

L elbow

L handHead

WaistR hip

R knee

R foot

L hip

L knee

L foot

(a) (b)

Figure 6: Example renderings. (a) Original image with lo-cated keypoints. (b) 3D rendering (green is left, red is right).

7

Page 8: Mori Mecv01

8/13/2019 Mori Mecv01

http://slidepdf.com/reader/full/mori-mecv01 8/8

R handR elbow

R shoulderL shoulder

L elbow

L hand

Head

WaistR hip

R knee

R foot

L hip

L knee

L foot

R hand

R elbow

R shoulderL shoulder

L elbow

L hand

Head

WaistR hip

R knee

R foot

L hip

L knee

L foot

(a) (b)

Figure 7: Example renderings. (a) Original image with lo-cated keypoints. (b)3D rendering(green is left, red is right).

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o

f t e s

t i m a g e s

Error Distribution for Hands

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o

f t e s

t i m a g e s

Error Distribution for Feet

(a) (b)

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o

f t e s

t i m a g e s

Error Distribution for Elbows

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o

f t e s

t i m a g e s

Error Distribution for Knees

(c) (d)

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o

f t e s

t i m a g e s

Error Distribution for Shoulders

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o

f t e s

t i m a g e s

Error Distribution for Hips

(e) (f)

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o f

t e s

t i m a g e s

Error Distribution for Head

0 30 60 90 120 150 180 210 240 270 300 3300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Distance in pixels

F r a c

t i o n o f

t e s

t i m a g e s

Error Distribution for Waist

(g) (h)

Figure 8: Distributions of error in 2D location of keypoints.(a) Hands, (b) Feet, (c) Elbows, (d) Knees, (e) Shoulders,(f) Hips, (g) Head, (h) Waist. Error (X-axis) is measured interms of pixels. Y-axis shows fraction of keypoints in eachbin. The average image size is 380 by 205 pixels. Large

errors in positions are due to ambiguities regarding left andright limbs.

8