hanbyul joo natalia neverova andrea vedaldi facebook ai ... · hanbyul joo natalia neverova andrea...

Exemplar Fine-Tuning for 3D Human Pose FittingTowards In-the-Wild 3D Human Pose Estimation

Hanbyul Joo Natalia Neverova Andrea Vedaldi

Facebook AI Research

Abstract

We propose a method for building large collections ofhuman poses with full 3D annotations captured ‘in the wild’,for which specialized capture equipment cannot be used. Westart with a dataset with 2D keypoint annotations such asCOCO and MPII and generates corresponding 3D poses.This is done via Exemplar Fine-Tuning (EFT), a new methodto fit a 3D parametric model to 2D keypoints. EFT is ac-curate and can exploit a data-driven pose prior to resolvethe depth reconstruction ambiguity that comes from usingonly 2D observations as input. We use EFT to augment theselarge in-the-wild datasets with plausible and accurate 3Dpose annotations. We then use this data to strongly supervisea 3D pose regression network, achieving state-of-the-art re-sults in standard benchmarks, including the ones collectedoutdoor. This network also achieves unprecedented 3D poseestimation quality on extremely challenging Internet videos.

1. IntroductionIn order to fully understand human actions and behav-

iors, machines should reconstruct the pose of humans in3D. Recently, there have been noticeable advances in 2Dhuman pose recognition [9, 8, 52, 36, 55] by means of con-volutional neural networks supervised with large, realisticdatasets annotated with 2D keypoints [29, 4, 19, 3]. How-ever, detecting 2D keypoints is insufficient to fully under-stand human motion; in fact the resulting representationdepends on the camera view point and fails to capture mo-tions that occur in the depth direction. Estimating the poseof humans in 3D from a single image has also advancedsignificantly [49, 50, 33, 41, 40, 37, 43, 22, 26, 53, 10] bymeans of datasets with full 3D annotations. However, suchdatasets are mostly limited to indoor/lab conditions; thus,authors have combined them with in-the-wild1 datasets thatcontain 2D keypoint annotations. The latter only provide a

1By ‘in the wild’ we mean that images are obtained from unconstrainedsources such as the Internet.

Figure 1: Our goal is to fit a parametric 3D human modelto existing 2D annotations to build a large scale datasetof 3D human poses in the wild. (Left) An example 2Dannotation [19]; (Middle) 3D fitting result by our ExemplarFine-Tuning (EFT) method; (Right) From a side view.

weak supervisory signal for 3D pose estimation, but enhancethe diversity and realism of the training images, improvingthe robustness and generality of the learned models. Still,the expectation is that better performance could be obtainedby means of a large dataset of images collected in the wildwith full 3D pose annotations.

In this paper, we thus consider the problem of buildingsuch a dataset. Specifically, there already exist large-scale in-the-wild datasets with 2D pose annotations (e.g. COCO [29],MPII [4], LSP [18, 19], and Posetrack [3]). Extending theseannotations to 3D can provide an important breakthroughin 3D pose estimation. However, even for an expert humanannotator it is often difficult to estimate the pose of a humanin 3D from a single image; furthermore, inputting this infor-mation in an annotation tool is difficult and time consuming.Hence, we develop an algorithm that can automatically liftthe existing 2D keypoint annotations to full 3D ones, asshown in Fig. 1. Even though the 2D keypoints are known,this is a difficult problem due to the depth ambiguity [6, 27].We wish our algorithm to achieve two goals: (1) It mustpredict 3D poses that feel natural and realistic despite thedepth ambiguity and (2) the 3D poses must be consistentwith the 2D keypoints annotations, re-projecting accuratelyonto them.

We solve this problem by introducing a new effectivetechnique, Exemplar Fine-Tuning (EFT), that can robustly

1

arX

iv:2

004.

0368

6v1

[cs

.CV

] 7

Apr

202

0

fit a 3D parametric model of humans [30] to 2D keypointannotations. EFT takes a fully-trained 3D pose regressorlearned to map the 2D inputs (raw images or 2D keypoints)to the parameters of the 3D model and fine-tunes the re-gressor on each test example separately. This approachdiffers from the traditional method of fitting the paramet-ric model [30, 38, 21] to the 2D annotations via optimiza-tion [6, 27]. In optimization, in fact, the algorithm directlyupdates the parameters of the 3D model to achieve a better fitof the 2D data, minimizing the re-projection error. EFT alsominimizes the re-projection error, but does so by fine-tuningthe neural network regressor that generates the parametersof the 3D model instead of changing the parameters directly.

EFT also differs from the way neural network regressorsare usually learned for the purpose of 3D pose estimationsince its goal is to obtain the best possible fit of the 3Dmodel to each input example rather than tuning the neuralnetwork to generalize well to unseen data. In fact, EFTuses a pre-trained 3D pose regressor neural network as apose prior for the purpose of fitting the parameters of the3D model to a single example. The motivation for doingso is that the neural network regressor is trained first byusing a large dataset of poses with 3D ground truth supervi-sion [50, 44, 37, 22, 23, 25, 53] and, in the process, it canlearn a better pose prior than the one captured by the paramet-ric models themselves, and used in traditional optimization-based approaches. An important advantage for the neuralnetwork is that, by definition, it learns a prior conditionalon the observation of the 2D inputs; hence, this conditionalprior should be more useful in predicting relevant 3D posesthan the generic pose distributions [2] used in previous ap-proaches [6, 27]. We support this hypothesis empirically, vianumerous experiments, showing that EFT is much more ro-bust in fitting a 3D model to challenging 2D pose annotationsthan traditional methods [6, 27], producing more plausibleresults while being less sensitive to the initialization andbeing more robust to missing 2D keypoints.

Finally, we use EFT to augment standard 2D in-the-wildhuman pose datasets such as COCO and MPII with full3D annotations, obtaining about 100K samples paired withcorresponding 3D pose annotations. We use this data andannotations to train a 3D pose regressor that outperforms thestate of the art in standard outdoor 3D human pose bench-marks [51]. We also test challenging real-world scenarioscontaining multiple people which are potentially croppedand heavily occluded. Furthermore, we use our 3D datasettogether with augmentations to simulate occlusions and learna pose regressor. We obtain in this manner unprecedented3D pose estimation performance on extremely challengingInternet videos. Our dataset will be made publicly available.

2. Related WorkThere has been significant progress in human pose esti-

mation in the last few years. Initial advances focused on 2Dpose recognition [9, 8, 52, 36, 55, 48], learning deep neuralnetworks from large in-the-wild datasets that contain 2Dkeypoint annotations. Improving 2D pose recognition hasin turn facilitated the more challenging task of 3D humanpose estimation [49, 50, 33, 41, 40, 37, 43, 22, 26, 53, 10].In this section, we focus on the latter task.

Single-Image 3D Human Pose Estimation. Reconstruct-ing the pose of humans in 3D from single views is an ill-posed problem as depth cannot be recovered uniquely. Inorder to reduce this ambiguity, algorithms must use a priormodel of the human body and of its likely poses. Methodsdiffer on how they incorporate the prior and in how theyperform the prediction.

Optimization-based methods assume a 3D body modelsuch as SMPL [30] and SCAPE [5], and use an optimiza-tion algorithm to fit it to the 2D observations. While earlyapproaches [11, 46] required manual input, starting withSMPLify [6] the process has been fully automatized, thenimproved in [27] to use silhouette annotations, and eventu-ally extended to multiple views and multiple people [59, 57].

Regression-based methods, on the other hand, predict3D pose directly. The work of [45] uses sparse linear re-gression that incorporates a tractable but somewhat weakpose prior. Later approaches use instead deep neural net-works, and differ mainly in the nature of their inputs andoutputs [49, 50, 33, 41, 40, 37, 43, 22, 26, 53, 10]. Someworks start from a pre-detected 2D skeleton (e.g. [33]), whileothers start from raw images (e.g. [22]). Using a 2D skeletonrelies on the quality of the underlying 2D keypoint detectorand discards appearance details that could help fitting the3D model to the image. Using raw images can potentiallymake use of this information, but training such models fromcurrent 3D indoor datasets might fail to generalize to uncon-strained images. Hence several papers combine 3D indoordatasets with 2D in-the-wild ones [22, 54, 26, 50, 44, 37, 22].Methods also differ in their output, with some predicting 3Dkeypoints directly [33], some predicting the parameters ofa 3D human body model [22, 54], and others volumetricheatmaps for the body joints [42].

Finally, hybrid methods such as SPIN [38] or MTC [53]combine optimization and deep regression approaches.

3D Without 3D Image Annotations. Among the variousreconstruction approaches, of particular interest for our workare methods that can perform 3D reconstruction withouthaving to rely on images annotated in 3D. Optimization-based methods such as SMPLify rely on parametric humanmodels and only need to predict a few model parameters.

2

As a downside, these methods cannot guarantee that theestimated pose is valid as they are only concerned with theminimization of the 2D re-projection error. In practice, theoutput quality is largely dependent on the initialization andoften fails for challenging poses (e.g., see Fig. 3 left panel).

The manifold of human body poses can be describedempirically, by collecting a large number of samples in lab-oratory conditions [15, 2, 38], but this may fail to properlymodel the plausibility of different poses.

Regression-based methods [22, 59] can also be learnedwithout requiring images with 3D annotations, by combining2D datasets with a parametric 3D model and empirical posesamples, integrating them into their neural network regres-sor by means of adversarial training. However, while thepredictions obtained by such methods are plausible, theyoften do not fit the 2D data very accurately. Fitting couldbe improved by refining this initial solution by means of anoptimization-based method as in SPIN [25], but empiricallywe have found that this distorts the pose once again, leadingto solutions that are not plausible anymore.

Human Pose Datasets. There are several in-the-wilddatasets with sparse 2D pose annotations, includingCOCO [29], MPII [4], Leeds Sports Pose Dataset (LSP) [18,19], PennAction [58] and Posetrack [3]. Furthermore, DensePose [13] has introduced a dataset with dense surface pointannotations, mapping images to a UV representation of aparametric 3D human model [30]. Compared to annotating2D keypoints, annotating 3D human poses is much morechallenging as there are no easy-to-use or intuitive tools toinput the annotations. Hence, current 3D annotations aremostly obtained by means of motion capture systems inindoor environments. Examples include the Human3.6Mdataset [17], Human Eva [47], Panoptic Studio [20], andMPI-INF-3DHP [35]. These datasets provide 3D motioncapture data paired with 2D images, but the images are verycontrolled. There exists an approach to produce a datasetwith 3D pose annotations on Internet photos [27]. How-ever, the traditional optimization-based fitting method usedin this work limits the quality and size of dataset. There arealso several large scale motion capture datasets that do nothave corresponding images at all (e.g. CMU Mocap [1] andKIT [32]). These motion capture datasets have recently beenreissued in a unified format in the AMASS dataset [31].

3. Preliminaries3.1. Parametric 3D Models of Humans

Parametric human models [5, 30, 21, 39] can representa variety of human shapes and poses by means of a smallnumber of parameters while capturing important propertiessuch as left-right symmetry and limb proportions. In thispaper, we consider in particular the SMPL model [30] and

the problem of fitting it to an image of a target human sub-ject. The SMPL parameters Θ = (θ,β) comprise the poseparameters θ ∈ R24×3, which control the rotations of 24body joints with respect to their parent joints, and the shapeparameters β ∈ R10, which control the body shape via 10principal directions of variations learned by using PCA. Thejoint location in the rest pose of the SMPL model is regressedby the vertices after applying the shape deformation by β,and the final joint locations and the posed mesh vertices areobtained via the transformations computed following theskeletal hierarchy. As we focus on the location of the 3Dbody joints, we write SMPL as the function

J = M(Θ), (1)

where J ∈ R24×3 are the 3D locations of the 24 joints.

3.2. Optimizing vs Regressing 3D Pose

Given an image I of a human, the goal is to find the param-eters Θ of SMPL that match the pose of the subject. Next,we discuss two classes of methods to solve this problem:optimization-based and regression-based.Optimization-based approaches [6, 27] extract from theimage 2D cues such as joints, silhouettes, and part labels,and optimize the model parameters to fit them. In particular,given the 2D locations j ∈ R24×2 of the body joints, onesolves the problem

Θ∗, π∗ = argminΘ,π

L2D

(π (M(Θ)) , j

)+ Lprior (Θ) , (2)

where M(·) is the SMPL model function of Eq. 1, π is thecamera projection function, which is often jointly optimized,that maps the 3D joints to their 2D locations and L2D is there-projection error between these and the input 2D locations.Due to the depth ambiguity, optimizing L2D is insufficientto recover the pose parameters Θ uniquely; to reduce thisambiguity, one adds the prior term Lprior to the loss, favoringplausible solutions. In SMPL, priors for both the shape andpose parameters are provided. The prior is often learned bymeans of a separate 3D dataset [1, 2].

3D models such as SMPL have enough degrees of free-dom that it is generally possible to fit the 2D annotationsaccurately.2 However, since the optimization is only local,the final 3D output depends strongly on the quality of theinitialization. Furthermore, it is difficult to balance the re-projection loss with the prior: too strong a prior may forcethe model to output a “mean” pose ignoring the 2D evidence,and too weak a prior may result in implausible or distortedposes. Local minima, furthermore, may require to breakthe optimization in multiple steps [6, 21], first aligning thetorso by a rigid transform, followed by optimizing the limbs.

2This may not be true for children or heavily-clothed people as these arenot represented in the dataset used to construct SMPL.

3

Failures in the first step lead to catastrophic failure, as shownin the left panel of Fig. 3.

Regression-based approaches predict the SMPL parame-ters Θ directly from the 2D cues I, which may be raw im-ages [22], 2D joints [33], or even dense 2D points [56]. Themapping is implemented by a neural network Θ = Φ(I)trained by means of a large dataset, often obtained bycombining 3D indoor dataset [17, 34] and 2D in-the-wildones [29, 4]. Training optimizes the loss function:

Φ∗ = argminΦ

1

N

N∑i=1

L2D

(π (M(Φ(Ii))) , ji

)+ µiLJ

(M(Φ(Ii)), Ji

)+ τiLΘ

(Φ(Ii), Θi

). (3)

This loss combines the 2D re-projection loss L2d, the 3Djoint reconstruction loss LJ, and the SMPL parameter re-construction loss LΘ, where ji, Ji and Θi are, respectively,the ground truth 2D joints, 3D joints, and SMPL parametersfor the i-th training samples Ii. µi and τi are loss-balancingcoefficients and they can be set to zero for samples that donot have 3D annotations. The parameters for the cameraprojection function π can be predicted as additional outputsof the neural network Φ [22, 23, 56, 25].

In most cases, the goal of learning-based approaches is totrain a model that generalizes well to unseen data, requiringa strict separation of training and test data for their evalua-tion. In our case, however, we wish to lift 2D annotations to3D, so the two sets coincide. Specifically, we learn a poseprior by combining 3D and 2D data, and this process is thesame as learning a neural network regressors by combining3D datasets with 2D in-the-wild ones [22, 23, 59, 25]. Here,however, the 2D in-the-wild datasets contain the samplesthat we wish to annotate 3D. By optimizing the neural net-work on the combined data, we do obtain an estimate of 3Dpose for all samples, including the ones annotated in 2D;however, these 3D poses may not necessarily align to the 2Dannotation in a very accurate manner, as they are generatedby a feed-forward neural network regressor. This problem issolved in the next section.

4. Exemplar Fine-Tuning

We introduce Exemplar Fine-Tuning, a new approach tofit the parameters of a 3D human model to the 2D locationsof its joints. Unlike the traditional optimization-based fittingmethods [7] that change the model parameters directly, fit-ting is obtained by fine-tuning a standard 3D pose regressornetwork [22, 25] to individual test samples (See Fig. 2). Theaim is to build on the strong 3D pose prior that the networkimplicitly learns as it is trained on a combination of datasetscontaining 3D and 2D annotations, including the sampleswhose annotations we wish to lift from 2D to 3D.

Figure 2: Our EFT fine-tunes a 3D pose regression neuralnetwork to fit the parametric 3D model on the target image,while traditional fitting method directly optimizes the pa-rameters of a parametric 3D model. Our EFT leverages theconditional pose prior learned by the pose regressor, whilethe traditional fitting method uses a fixed pose prior producedby a separate 3D dataset.

In more detail, EFT optimizes the neural network modelΦ by minimizing the exemplar loss:

Φ∗t = argmin

Φt

L2D

(π (M(Φt(It))) , jt

)+ λ‖β‖22, (4)

where It is the current target image and the second term is aregularizer to control the shape parameters. Eq. (4) assumesthat the only cues available are the 2D joint locations, butEFT can be applied to any 2D or 3D cues, including Dense-Pose, 2D face keypoints, or even 3D keypoints when fittingthe SMPL model to a 3D skeleton [21]. The final parametricfit Θ∗

t to the target image It is obtained by evaluating thefine-tuned network Φ∗

t on the target sample as Θ∗t = Φ∗

t (It).Like traditional optimization-based methods such as SM-

PLify, EFT solves an optimization problem for each sam-ple independently, but with a crucial difference: the neuralnetwork regressor, and the pose prior embedded in it, are re-tained during the optimization; in contrast, traditional meth-ods use a network only for initialization [25, 12, 56]. Duringthe process of EFT, we found that the output of networkmaintains the plausibility of 3D human pose while mini-mizing the re-projection errors, even for samples with badinitialization due to the occlusions or unusual body poses (insuch cases the standard 3D pose regressor tends to produceless precise outputs).

Implementation Details: As regressor, we use the state-of-the-art SPIN network of [25]. For each sample, EFTrestart the network to its initial pre-trained state and thenoptimizes Eq. (4) for 20 iterations using Adam [24] withthe default PyTorch parameters and a small learning rateof 10−6. For fine-tuning, batch normalization and dropoutlayers are removed.

We found that human annotators are often inaccuratewhen annotating hips and ankles, and may also confuse theleft and right side of the body. This noise can adversely affect

4

the quality of our 3D fits — for instance, if a limb appearshorter than it should, the predictor may tilt limbs in 3Dspace to compensate. Thus, Eq. 4 is modified to ignore hipsand ankles, and a term is added to match the 2D orientationof the lower legs instead.

5. Learning a robust 3D regressorBy using EFT to lift 2D annotations to 3D, we can obtain

a large dataset of images in the wild with full 3D annotationsand use the latter to strongly-supervise a state-of-the-artneural network regressor for 3D pose estimation such as [25].An alternative is to use the original 2D annotations for weaksupervision, which is done in many papers including [22, 23,59]. However, we show empirically that using the strong 3Dsupervisory signal from EFT is much more effective, whichis in line with similar observations in the work of [25].

Augmentation by Extreme Cropping: A shortcoming ofprevious pose estimation methods is that they assume thatmost of the body is visible in the input image [56, 12, 22, 25].However, humans captured in real-world videos are oftencropped or occluded, so that only the upper body or even justthe face may be visible (see Fig. 7). Occlusions dramaticallyincrease the ambiguity of pose reconstruction, as in this casenot just depth, but the whole position of several keypointsis not observable at all. Hence, the quality of the recon-structions depends even more strongly on the quality of theunderlying pose prior. Here, we wish to train a model thatcan handle such difficult cases. We propose to do so by aug-menting the training data with extreme cropping. Since wealready have full 3D annotations, doing so is straightforward— we only need to randomly crop training samples. We doso by first cropping either the upper body up to the hips, orthe face and shoulders up to the elbows, and then furthercrop the result using a random bounding box of size equal to80%-120% of that of the first crop. While the input image iscropped, we retain the full 2D/3D body joint supervision toallow the network to learn to reconstruct the occluded bodyparts in a plausible manner.

Learning from Difficult In-the-wild Samples: Currentapproaches [22, 25] avoid using difficult samples to learntheir neural networks, even if 2D annotations are availablefor them. In standard datasets such as COCO, samples canbe discarded due to heavy occlusion or low resolution. Weargue that this data is instead very valuable in order to learnto handle similarly-difficult cases at test time. Given ournewly-trained model which is highly robust to occlusions,we can use our EFT method to augment the difficult exam-ples with corresponding 3D annotations. Once the the 3Dannotations are available, we can retrain our model to im-prove its robustness, and eventually repeated the process.

However, since the 3D reconstruction of joints that are notvisible in the image is very ambiguous, these particular joints(rather than the samples as a whole) are ignored in the train-ing losses. Note also that this is not the case for the croppingaugmentation strategy described above because in that case3D reconstructions are obtained before cropping is applied,when all joints are visible.

6. ResultsWe study the ability of EFT to augment existing in-the-

wild dataset containing 2D keypoint annotations with cor-responding 3D pose annotations. Since by definition thisdata does not come with ground-truth 3D poses to compareagainst, we asses EFT in other ways. First, we compare EFTto a traditional optimization-based regressor qualitatively,by asking human annotators on Amazon Mechanical Truck(AMT) to choose the best fit between EFT and SMPLify [6].Second, we investigate the effect of fine-tuning individualsamples in EFT, re-evaluating the performance of the fine-tuned networks on a validation set. Third, we apply EFTto public datasets with 2D annotations to generate a largescale in-the-wild 3D pose dataset and demonstrate the ben-efit of using this data to train a 3D human pose regressorwith full 3D supervision. In particular, we achieve state-of-the art performance on a standard outdoor 3D human posebenchmark [51]. Finally, we test the performance of thisstate-of-the-art regressor on challenging real-world videos.

6.1. Data

We briefly summarize the human pose datasets used in ourexperiments (see also Table 1). There are in-the-wild datasetswith manual 2D pose annotations, including COCO [29],MPII [4], LSP [18, 19]. We apply EFT to all these 2Ddatasets in order to generate 3D annotations in the form ofSMPL model parameters for all their images. There are afew datasets with accurate 3D pose annotations, includingH36M [17, 16] and MPI-INF-3DHP [34]. Since a multi-viewsetup is often required to capture this kind of ground truth,these datasets are collected inside a studio. An exception isthe video dataset 3DPW [51], which has 3D ground truth butis captured outdoor. We use only the 3DPW test set, and onlyfor evaluation. To train our final 3D pose regressor, we usethe outputs of our EFT fitting on the in-the-wild 2D datasets,with the “moshed” version of two 3D datasets, H36M andMPI-INF-3DHP (following [22, 25]). We use all datasetslisted above for training the final pose regressor.

6.2. EFT vs. Traditional Model Fitting

We compare EFT’s output with a state-of-the-art model fit-ting approach, namely SPIN [25]. SPIN applies SMPLify [6]to an initial 3D pose fit obtained via a regression method sim-ilar to HMR [22]. During training, SPIN alternates modelfitting and network training. Implicitly, this amounts to

5

3D Dataset Sample Num. 2D Dataset Sample Num3DPW [51] 35K (Testing) COCO [29] 79KH36M [17] 312K MPII [4] 21KMPI-INF [34] 96K LSP (+ Ext.) [18, 19] 10K

Table 1: Summary of public DBs. We use our EFT to gener-ate corresponding 3D poses for all 2D pose DBs.

producing 3D annotations for the data that only has 2D anno-tations, which is similar to our approach. As we show below,however, EFT is more effective a solution.

Qualitative comparison via AMT: Since there is noground truth 3D data available for in-the-wild images, weevaluate the methods qualitatively by means of Amazon Me-chanical Turk (AMT). We show to human annotators the 3Dpose fittings obtained by EFT and SPIN for 500 randomly-chosen images from the MPII, COCO and LSPet datasets.3

We only use images where all 2D keypoints are visible asthis is done in the original SPIN paper. To demonstrate therobustness of EFT in fitting challenging samples, we con-sider two different sets: (1) the “easy” set with 500 sampleswhere the traditional fitting method tends to be successful,and (2) the “hard” set where such methods tend to fail. Theeasy and hard samples are determined as in [25] by lookingat the shape parameters estimated by SPIN.

Each sample is shown to three different voters in AMT,displaying the input image and 3D renderings of the posefitted by both methods twice, from the same viewpoint asthe image, and from the side. Examples are shown in theFig. 3. The annotators are asked to choose the best fit. Ourmethod obtained 58.4% favorable votes in the easy sampleset, and 81.6% in the hard sample set. We found that SPINsuffers from bad initialization, especially for occlusions,challenging body poses, and low image resolution.

EFT on challenging samples: We run our EFT fitting onall available samples in 2D databases including the challeng-ing samples ignored in the previous work [25]. Examples areshown in Fig. 4, where some body parts are not annotateddue to the severe occlusions or extremely low resolutions.The initial output of the off-the-shelf pose regressor [22, 25]tends to be incorrect, providing bad initialization for transi-tional optimization method, while our EFT can effectivelyminimize the re-projection errors, maintaining the plausibil-ity of the 3D pose estimate while also producing a reasonableguess for the unconstrained body parts. More examples areshown in our sup. mat.

3For SPIN, we use the fits published by the authors at https://github.com/nkolot/SPIN.

Figure 3: Examples used for the AMT study. Traditionalmethods (red) fail for challenging examples, while EFT(blue) fits them well.

Figure 4: EFT results on challenging samples in COCOdataset with low resolution or occlusions. Our EFT stillcan fit them by predicting the unconstrained parts based onlearned prior in the pose regressor.

0 100 200 300 400 500Sample Index

40

50

60

70

80

90

100

Avg.

Rec

on. E

rror (

mm

)

After EFT (init. by SPIN)SPINHMR

0 100 200 300 400 500Sample Index

40

50

60

70

80

90

100

Avg.

Rec

on. E

rror (

mm

)

After EFT (init. by SPIN)SPINHMR

Figure 5: Testing error changes (blue curves) on 3DPWafter EFT on each sample, after 20 iteration (left) and 100iteration (right); (Orange line) Testing error by SPIN [25];(Green line) Testing error of HMR [22].

6.3. Effect of Fine-Tuning

The core idea of EFT is to fine-tune the neural networkregressor to individual samples to improve their 2D fit. How-ever, fine-tuning a neural network with just a single samplemay change or break the normal behavior of network modeldue to overfitting.

6

https://github.com/nkolot/SPIN

https://github.com/nkolot/SPIN

Figure 6: Example samples that cause significant changes inthe 3DPW testing error of network after EFT. The left twoexamples have annotations on occluded body parts, and therightest example has incorrect annotations (left-right swap).

Methods 3DPW (mm) H36M P-1 (mm)HMR [22] 81.3 58.1Kanazawa et al. [23] 72.6 56.9HoloPose [12] – 50.5HoloPose (w/ post-opt.) [12] – 46.5SPIN [25] 59.2 44.3Ours (our COCO only) 59.3 69.8Ours (3D + our wild DB) 56.1 44.0Ours (3D + our wild DB + crop aug.) 55.7 45.2

Table 2: Quantitative Evaluation on 3DPW and H36MProtocol-1. Reconstruction Errors are reported in mm scaleafter rigid-transform alignments.

To test this, Fig. 5 shows the change in reconstructionaccuracy on the 3DPW dataset for 500 different neural net-work regressors, each of which is obtained by fine-tuning theoriginal regressor to one of 500 different in-the-wild samples.The performance of the model before fine tuning is shownas an orange line (the publicly-available SPIN model [25]),and the performance of the HMR baseline as a green linefor comparison. The left plot shows EFT applied for 20iterations (the default), and the right plot for 100 iterations,which completely overfits the network to each target sample.

As it can be seen, overfitting the network to individualsamples usually has a small effect on the overall regressionperformance, suggesting that the network retains its goodproperties despite fine-tuning exemplars. In particular, theperformance is at least as good as the HMR baseline [22](green straight lines), and occasionally there can even be aslight overall improvement. The effect is different for differ-ent samples — via inspection, we found that samples witha strong effect contain significant occlusions or annotationerrors, as shown in Fig. 6.

Finally, note that after EFT is used to predict the pose ofa single example the fine-tuned network is discarded — thisanalysis is only meant to illustrate the effect of fine-tuning,but has no direct implication on the effectiveness of EFT.

6.4. Learning a robust pose regressor

The main objective in generating 3D annotations for in-the-wild data is to train a robust 3D pose estimation regressorin a fully supervised manner. To demonstrate the effective-

Methods Recon. Error (mm)SPIN [25] 133.2Ours (3D + our wild 3D) 140.1Ours (3D + our wild 3D + crop aug.) 76.2

Table 3: Reconstruct errors (in mm) after rigid alignment,by using the cropped upper body as input in 3DPW test set.

ness of the annotations produced by EFT, we train the state-of-the-art SPIN regressor of [25] on our newly-generated 3Ddataset and asses its 3D pose estimation performance.

Performance on 3DPW and H36M. We train the SPINnetwork [25] using the same 3D and 2D datasets as in theoriginal paper, but we replace the 2D annotations with our 3Dversion of the same samples, obtained by EFT on the COCO,MPII, and LSP datasets. We evaluate the performance ofour newly trained network on two public test benchmarks:3DPW (outdoor) and H36M (indoor). We report the recon-struction errors in mm after rigid alignment, following [25].The results are summarized in Table 2.

Our method (denoted by Ours, 3D + our wild DB) re-sults in a performance improvement for the outdoor dataset(3DPW) due to the improved 3D annotations for the in-the-wild dataset used for training the model. However, theperformance on H36M changes little, presumably becausethis dataset comes with ground-truth 3D annotations fromthe outset. With crop augmentation, as shown in the last rowof Table 2, the performance becomes slightly better for the3DPW dataset, but slightly worse in H36M dataset. This isunderstandable because 3DPW dataset has scenes where legsare not observable, but the H36M dataset does not containsuch examples.

As another meaningful test, we train the model usingonly the COCO dataset with the 3D annotations that ourEFT produces, without using any of the 3D datasets (de-noted by Ours, our COCO only), where we also includethe samples with occlusions in COCO training set that wasexcluded in [25]. Training with this dataset without anyindoor datasets achieves approximately comparable perfor-mance on the 3DPW benchmark, but a higher error on H36Mbenchmark. This is seemingly in contradiction with the factthat other works have found H36M to be an “easy” datasetwhere a high-performance can be achieved; instead, high-performance on the H36M test set may be possible only byoverfitting a model on the H36M training set.

Performance on Upper Body Crops: Since none of theexisting datasets with 3D annotations contain significantocclusions, we asses robustness to occlusions by croppingupper bodies from 3DPW and feeding them to the networkas input. We use the same metric as in the previous exper-iment. The result is shown in Table 3. While both SPIN

7

Figure 7: 3D pose estimation results by the model trained with our 3D pose dataset on challenging in-the-wild video sequences.Bounding boxes are provided by an off-the-shelf detector [14].

and our network without crop augmentation work poorly,our model trained with crop augmentation shows much bet-ter performance (76.2 mm), where the key advantage is toguess correctly the non-visible leg parts. Remarkably, theperformance of this model is better than that achieved byHMR by observing the whole body (81.3 mm).

Qualitative Evaluation on Internet Videos: We demon-strate the performance of our robust model on various chal-lenging real-wold Internet videos, containing cropping, blur,fast motion, multiple people, and other challenging effects —data of this complexity was rarely considered in prior work.Examples are shown in Fig. 7 and more videos can be foundin the sup. mat.

7. Discussion

We have present Exemplar Fine-Tuning (EFT), a newtechnique that can fit a parametric 3D human body modelto 2D keypoint annotations in a more plausible and accu-rate manner than existing methods. We have used EFT toaugment existing in-the-wild 2D pose datasets with 3D anno-tations, obtaining about 100K in-the-wild annotated 3D posesamples. By using our annotations, we train a deep poseregression network that outperforms the state of the art onstandard benchmarks. Furthermore, this regressor achievescompelling results in challenging Internet videos.

We will release the new 3D annotations to the commu-nity, opening the possibility of using them in many othertasks, including dense keypoint detection [13] or depth esti-mation [28].

8

References[1] CMU graphics lab motion capture database.[2] Ijaz Akhter and Michael J Black. Pose-conditioned joint angle

limits for 3d human pose reconstruction. In CVPR, 2015.[3] M. Andriluka, U. Iqbal, E. Ensafutdinov, L. Pishchulin, A.

Milan, J. Gall, and Schiele B. PoseTrack: A benchmark forhuman pose estimation and tracking. In CVPR, 2018.

[4] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, andBernt Schiele. 2d human pose estimation: New benchmarkand state of the art analysis. In CVPR, 2014.

[5] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Se-bastian Thrun, Jim Rodgers, and James Davis. Scape: shapecompletion and animation of people. TOG, 2005.

[6] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, PeterGehler, Javier Romero, and Michael J. Black. Keep it smpl:Automatic estimation of 3d human pose and shape from asingle image. In ECCV, 2016.

[7] Federica Bogo, Javier Romero, Gerard Pons-Moll, andMichael J Black. Dynamic faust: Registering human bodiesin motion. In CVPR, 2017.

[8] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, andYaser Sheikh. OpenPose: realtime multi-person 3D poseestimation using Part Affinity Fields. In CVPR, 2018.

[9] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinityfields. In CVPR, 2017.

[10] Yu Cheng, Bo Yang, Bo Wang, Wending Yan, and Robby T.Tan. Occlusion-aware networks for 3d human pose estimationin video. In ICCV, October 2019.

[11] P. Guan, A. Weiss, A. O. Balan, and M. J. Black. Estimatinghuman shape and pose from a single image. In Proc. ICCV,2009.

[12] Riza Alp Guler and Iasonas Kokkinos. Holopose: Holistic 3dhuman reconstruction in-the-wild. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 10884–10894, 2019.

[13] Rıza Alp Guler, Natalia Neverova, and Iasonas Kokkinos.Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 7297–7306, 2018.

[14] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In ICCV, 2017.

[15] Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce.Learning motion manifolds with convolutional autoencoders.In SIGGRAPH Asia 2015 Technical Briefs, page 18. ACM,2015.

[16] Catalin Ionescu, Fuxin Li, and Cristian Sminchisescu. Latentstructured models for human pose estimation. In ICCV, 2011.

[17] Catalin Ionescu, Dragos Papava, Vlad Olaru, and CristianSminchisescu. Human3.6m: Large scale datasets and predic-tive methods for 3d human sensing in natural environments.TPAMI, 2014.

[18] Sam Johnson and Mark Everingham. Clustered pose andnonlinear appearance models for human pose estimation. InBMVC, 2010.

[19] Sam Johnson and Mark Everingham. Learning effective hu-man pose estimation from inaccurate annotation. In CVPR,2011.

[20] Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan,Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe, IainMatthews, et al. Panoptic studio: A massively multiviewsystem for social interaction capture. TPAMI, 2017.

[21] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture:A 3d deformation model for tracking faces, hands, and bodies.In CVPR, 2018.

[22] Angjoo Kanazawa, Michael J Black, David W Jacobs, andJitendra Malik. End-to-end recovery of human shape andpose. In CVPR, 2018.

[23] Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jiten-dra Malik. Learning 3d human dynamics from video. InCVPR, 2019.

[24] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[25] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, andKostas Daniilidis. Learning to reconstruct 3d human poseand shape via model-fitting in the loop. In ICCV, 2019.

[26] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis.Convolutional mesh regression for single-image human shapereconstruction. In Proc. CVPR, 2019.

[27] Christoph Lassner, Javier Romero, Martin Kiefel, FedericaBogo, Michael J. Black, and Peter V. Gehler. Unite the people:Closing the loop between 3d and 2d human representations.In CVPR, 2017.

[28] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, NoahSnavely, Ce Liu, and William T Freeman. Learning the depthsof moving people by watching frozen people. CVPR, 2019.

[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InECCV, 2014.

[30] Matthew Loper, Naureen Mahmood, Javier Romero, GerardPons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. TOG, 2015.

[31] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Ger-ard Pons-Moll, and Michael J. Black. Amass: Archive ofmotion capture as surface shapes. In ICCV, 2019.

[32] Christian Mandery, Omer Terlemez, Martin Do, NikolausVahrenkamp, and Tamim Asfour. The kit whole-body humanmotion database. In ICAR, 2015.

[33] Julieta Martinez, Rayat Hossain, Javier Romero, and James JLittle. A simple yet effective baseline for 3d human poseestimation. In ICCV, 2017.

[34] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Olek-sandr Sotnychenko, Weipeng Xu, and Christian Theobalt.Monocular 3d human pose estimation in the wild using im-proved cnn supervision. In 3DV, 2017.

[35] Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller,Weipeng Xu, Srinath Sridhar, Gerard Pons-Moll, and Chris-tian Theobalt. Single-shot multi-person 3d pose estimationfrom monocular rgb. In 3DV, 2018.

9

[36] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In ECCV, 2016.

[37] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe-ter V. Gehle, and Bernt Schiele. Neural body fitting: Unifyingdeep learning and model based human pose and shape esti-mation. In 3DV, 2018.

[38] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, TimoBolkart, Ahmed A. A. Osman, Dimitrios Tzionas, andMichael J. Black. Expressive body capture: 3D hands, face,and body from a single image. In Proc. CVPR, 2019.

[39] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, TimoBolkart, Ahmed A. A. Osman, Dimitrios Tzionas, andMichael J. Black. Expressive body capture: 3d hands, face,and body from a single image. In CVPR, 2019.

[40] Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis.Ordinal depth supervision for 3D human pose estimation. InCVPR, 2018.

[41] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis,and Kostas Daniilidis. Coarse-to-fine volumetric predictionfor single-image 3d human pose. In CVPR, 2017.

[42] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis.Coarse-to-fine volumetric prediction for single-image 3D hu-man pose. In Proc. CVPR, 2017.

[43] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and KostasDaniilidis. Learning to estimate 3D human pose and shapefrom a single color image. In Proc. CVPR, 2018.

[44] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and KostasDaniilidis. Learning to estimate 3d human pose and shapefrom a single color image. In CVPR, pages 459–468, 2018.

[45] Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Re-constructing 3d human pose from 2d image landmarks. InCVPR, 2012.

[46] Leonid Sigal, Alexandru Balan, and Michael J. Black. Com-bined discriminative and generative articulated pose and non-rigid shape estimation. In Proc. NIPS. 2008.

[47] Leonid Sigal, Alexandru O Balan, and Michael J Black. Hu-maneva: Synchronized video and motion capture dataset andbaseline algorithm for evaluation of articulated human motion.IJCV, 2010.

[48] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation.arXiv preprint arXiv:1902.09212, 2019.

[49] V. Tan, I. Budvytis, and R. Cipolla. Indirect deep structuredlearning for 3D human body shape and pose prediction. InProc. BMVC, 2017.

[50] Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, andKaterina Fragkiadaki. Self-supervised learning of motioncapture. In NIPS, 2017.

[51] Timo von Marcard, Roberto Henschel, Michael Black, BodoRosenhahn, and Gerard Pons-Moll. Recovering accurate 3dhuman pose in the wild using imus and a moving camera. InECCV, 2018.

[52] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and YaserSheikh. Convolutional pose machines. In CVPR, 2016.

[53] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monoculartotal capture: Posing face, body, and hands in the wild. InProc. CVPR, 2019.

[54] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monoculartotal capture: Posing face, body, and hands in the wild. InCVPR, 2019.

[55] Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines forhuman pose estimation and tracking. In ECCV, 2018.

[56] Yuanlu Xu, Song-Chun Zhu, and Tony Tung. Denserac: Joint3d pose and shape estimation by dense render-and-compare.In ICCV, 2019.

[57] A. Zanfir, E. Marinoiu, and C. Sminchisescu. Monocular3D pose and shape estimation of multiple people in naturalscenes — the importance of multiple scene constraints. InProc. CVPR, 2018.

[58] Weiyu Zhang, Menglong Zhu, and Konstantinos G Derpanis.From actemes to action: A strongly-supervised representationfor detailed action understanding. In ICCV, pages 2248–2255,2013.

[59] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, andYichen Wei. Towards 3d human pose estimation in the wild:a weakly-supervised approach. In ICCV, 2017.

10

hanbyul joo natalia neverova andrea vedaldi facebook ai ... · hanbyul joo natalia neverova andrea...

Documents