learning to dress 3d people in generative clothing · 2019-12-18 · learning to dress 3d people in...

Learning to Dress 3D People in Generative Clothing

Qianli Ma1, Jinlong Yang1, Anurag Ranjan1,2, Sergi Pujades4, Gerard Pons-Moll5,Siyu Tang1,2,3, and Michael J. Black1

1Max Planck Institute for Intelligent Systems, Tubingen 2University of Tubingen 3ETH Zurich4Universite Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK

5Max Planck Institute for Informatics, Saarland Informatics Campus

(a) (b) (c) (d) (e) (f)

Figure 1: CAPE model for clothed humans. Given a SMPL body shape (a), CAPE varies clothes by randomly samplingfrom the learned model (b, c), can generate different clothing types — long-pants in (d) vs. shorts in (b, c). The generatedclothed humans can generalize to diverse body shapes (e) and body poses (f). Best viewed zoomed-in on screen.

Abstract

Three-dimensional human body models are widely usedin the analysis of human pose and motion. Existing models,however, are learned from minimally-clothed 3D scans andthus do not generalize to the complexity of dressed peoplein common images and videos. Additionally, current mod-els lack the expressive power needed to represent the com-plex non-linear geometry of pose-dependent clothing shape.To address this, we learn a generative 3D mesh model ofclothed people from 3D scans with varying pose and cloth-ing. Specifically, we train a conditional Mesh-VAE-GAN tolearn the clothing deformation from the SMPL body model,making clothing an additional term on SMPL. Our modelis conditioned on both pose and clothing type, giving theability to draw samples of clothing to dress different bodyshapes in a variety of styles and poses. To preserve wrinkledetail, our Mesh-VAE-GAN extends patchwise discrimina-tors to 3D meshes. Our model, named CAPE, representsglobal shape and fine local structure, effectively extending

the SMPL body model to clothing. To our knowledge, thisis the first generative model that directly dresses 3D humanbody meshes and generalizes to different poses.

1. IntroductionExisting generative human models [6, 23, 34, 38] are be-

coming more and more expressive and successfully cap-ture the statistics of human shape and pose deformation,but still miss an important component, the clothing. Thisleads to several problems in various applications. For ex-ample, when body models are used to generate synthetictraining data [21, 44, 45, 51], the minimal body geome-try results in a significant domain gap between syntheticand real images of humans. Deep learning methods re-construct human shape from images, based on minimallydressed human models [5, 24, 27, 28, 31, 37, 38, 39]. Al-though the body pose matches the image observation, theminimal body geometry does not match clothed humans formost of the cases. These problems motivate the need for a

1

arX

iv:1

907.

1361

5v2

[cs

.CV

] 1

7 D

ec 2

019

parametric clothed human model.Our goal is to create a generative model of clothed hu-

man bodies that is low-dimensional, easy to pose, differ-entiable, can represent different clothing types on differentbody shapes and poses, and produces visually plausible re-sults. To achieve this, we extend SMPL and factorize cloth-ing shape from the undressed body, treating clothing as anadditive displacement on the canonical pose (see Fig. 2).The learned clothing layer is compatible with the SMPLbody model by design, enabling easy re-posing and anima-tion. We formulate the problem of modeling the clothinglayer as a conditional generative task. Since the clothingshape is intrinsically stochastic, for a single pose and bodyshape, multiple clothing deformations can be sampled fromour model. Our model, called CAPE for “Clothed Auto Per-son Encoding”, is conditioned on clothing types and bodyposes, so that it captures different types of clothing, and cangenerate pose-dependent deformations, which are importantto realistically model clothing.

We illustrate the key elements of our model in Fig. 1.Given a SMPL body shape, pose and clothing type, CAPEcan generate different structures of clothing by sampling alearned latent space. The resulting clothing layer plausiblyadapts to different body shapes and poses.

Technical approach. We represent clothing as a dis-placement layer using a graph that inherits the topol-ogy of SMPL. Each node in this graph represents the 3-dimensional offset vector from its corresponding vertexon the underlying body. To learn a generative model forsuch graphs, we build a graph convolutional neural network(Sec. 4), under the framework of a VAE-GAN [7, 30], us-ing graph convolutions [11] and mesh sampling [43] as thebackbone layers. This addresses the problem with exist-ing generative models designed for 3D meshes of humanbodies [33, 52] or faces [43] that tend to produce over-smoothed results, which is not suitable for clothing wherehigh-frequency wrinkles matter. Specifically, the GAN [16]module in our system encourages visually plausible wrin-kles. We model the GAN using a patch-wise discriminatorfor mesh-like graphs, and show that it effectively improvesthe quality of the generated fine structures.

Dataset. We introduce a dataset of 4D captured people per-forming a variety of pose sequences, in different types ofclothing (Sec. 5). Our dataset consists of over 80K framesof 8 male and 3 female subjects captured using a 4D scan-ner. We use this dataset to train our network, resulting in aparametric generative model of the clothing layer.

Versatility. CAPE is designed to be “plug-and-play” formany applications that already use SMPL. Dressing SMPLwith CAPE yields 3D meshes of people in clothing, whichcan be used for several applications such as generatingtraining data, parametrizing body pose in a deep network,

having a clothing “prior”, or as part of a generative analysis-by-synthesis approach [21, 45, 51]. We demonstrate this onthe task of image fitting by extending SMPLify [9] with ourmodel. We show that using CAPE together with SMPLifycan significantly improve the quality of reconstructed hu-man bodies in clothing.

In summary, our key contributions are: (1) We formulatethe problem of 3D clothing / clothed human modeling asa conditional generative task. (2) For this task, we learna conditional Mesh-VAE-GAN that captures both globalshape and local detail of a mesh, with controlled condi-tioning based on human pose and clothing types. (3) Thelearned model can generate pose-dependent deformationsof clothing, and generalizes to a variety of garments. (4)We augment the 3D human body model SMPL with ourclothing model, and show an application of the enhanced“clothed-SMPL”. (5) We contribute a dataset of 4D scans ofclothed humans performing a variety of motion sequences.

2. Related Work

The capture, reconstruction and modeling of clothing hasbeen widely studied. There are two main classes of meth-ods: (1) reconstruction and capture approaches, which pre-dict 3D clothing from images or point-clouds, and (2) para-metric models of clothing, which predict how clothing de-forms as a function of pose and body shape. Table 1 showsrecent methods classified according to these criteria.

Reconstructing 3D humans. Reconstruction of 3D hu-mans from 2D images and videos is a classical computervision problem. Most approaches [9, 18, 24, 27, 28, 31,37, 38, 48] output 3D body meshes from images, but notclothing. This ignores image evidence that may be use-ful. To reconstruct clothed bodies, methods use volumet-ric [35, 46, 50, 59] or bi-planar depth representations [12]to model the body and garments as a whole. We refer tothese as Group 1 in Table 1. While these methods deal witharbitrary clothing topology and preserve a high level of de-tail, the reconstructed clothed body is not parametric, whichmeans the pose, shape, and clothing of the reconstructioncan not be controlled or animated.

Another group of methods are based on SMPL [1, 2, 3, 4,8, 60]. They represent clothing as an offset layer from theunderlying body as proposed in ClothCap [41]. We referto these approaches as Group 2 in Table 1. These meth-ods can change the pose and shape of the reconstructionusing the deformation model of SMPL. This assumes cloth-ing deforms like an undressed human body; i.e. that cloth-ing shape and wrinkles do not change as a function of pose.We also use a body-to-cloth offset representation to learnour model, but critically, we learn a neural function map-ping from pose to multi-modal clothing offset deformations.Hence, our work differs from these methods in that we learn

2

Table 1: Selection of related methods.

Method Class Methods Parametric Pose-dep. Full-body Clothing Captured Code ProbabilisticModel Clothing Clothing Wrinkles Data∗ Public Sampling

Image Group 1 † No No Yes Yes Yes Yes NoReconstruction Group 2 ‡ Yes No Yes Yes Yes Yes NoCapture ClothCap [41] Yes No Yes Yes Yes No No

DeepWrinkles [29] Yes Yes No Yes Yes No NoYang et al. [57] Yes Yes Yes No Yes No No

Clothing Wang et al. [54] Yes No No No No Yes YesModels DRAPE [17] Yes Yes Yes Yes No No No

Sanesteban et al. [47] Yes Yes Yes Yes No No NoGarNet [19] Yes Yes Yes Yes No No NoOurs Yes Yes Yes Yes Yes Yes Yes

* As opposed to simulated / synthetic data.† Group 1: BodyNet [50], DeepHuman [59], SiCloPe [35], PiFU [46], MouldingHumans [12]. ‡ Group 2: Octopus [1], MGN [8], Tex2Shape [4].Two main 3D clothing method classes exist: (1) Image reconstruction and capture methods, and (2) Clothing models which predict deformation as afunction of pose. Within each class, methods differ according to the criteria in the columns.

a parametric model of how clothing deforms with pose.

Parametric models for 3D bodies and clothes. Statisti-cal 3D human body models learned from 3D body scans,[6, 23, 34, 38] capture body shape and pose and are animportant building block for multiple applications. Mostof the time, however, people are dressed and these modelsdo not represent clothing. In addition, clothes deform aswe move, producing changing wrinkles at multiple spatialscales. While clothing models learned from real data ex-ist, few generalize to new poses. For example, Neophytouand Hilton [36] learn a layered garment model on top ofSCAPE [6] from dynamic sequences, but generalization tonovel poses is not demonstrated. Yang et al. [57] train a neu-ral network to regress a PCA-based representation of cloth-ing, but show generalization on the same sequence or on thesame subject. Lahner et al. [29] learn a garment-specificpose-deformation model by regressing low-frequency PCAcomponents and high frequency normal maps. While thevisual quality is good, the model is garment-specific anddoes not provide a solution for full-body clothing. Simi-larly, Alldieck et al. [4] use displacement maps with a UV-parametrization to represent surface geometry, but the resultis only static. Wang et al. [54] allow manipulation of cloth-ing with sketches in a static pose. The Adam model [23] canbe considered clothed but the shape is very smooth and notpose-dependent. Clothing models have been learned fromphysics simulation of clothing [17, 19, 47], but visual fi-delity is limited by the quality of the simulations. Further-more, the above methods are regressors that produce singlepoint estimates. In contrast, our model is generative, whichallows us to sample clothing.

A conceptually different approach infers the parametersof a physical clothing model from 3D scan sequences [49].This generalizates to novel poses, but the inference problemis difficult and, unlike our model, the resulting physics sim-

ulator is not differentiable with respect to the parameters.

Generative models on 3D meshes. Our generative modelpredicts clothing displacements on the graph defined by theSMPL mesh using graph convolutions [10]. There is anextensive recent literature on methods and applications ofgraph convolutions [11, 26, 33, 43, 52]. Most relevant here,Ranjan et al. [43] learn a convolutional autoencoder usinggraph convolutions [11] with mesh down- and up-samplinglayers [13]. Although it works well for faces, the mesh sam-pling layer makes it difficult to capture the high frequencywrinkles, which are key in clothing. In our work, we cap-ture high frequency wrinkles by extending the PatchGAN[22] architecture to 3D meshes.

3. Additive Clothed Human Model

To model clothed human bodies, we factorize them intothe minimally-clothed body and a clothing layer repre-sented as displacements from the body. This enables usto naturally extend SMPL to a class of clothing types bytreating clothing as an additional additive shape term. SinceSMPL is in wide use, our goal is to extend it in a way thatis consistent with current uses, making it effectively a “dropin” replacement for SMPL.

3.1. Dressing SMPL

SMPL [34] is a generative model of human bodies thatfactors the surface of the body into shape (β) and pose (θ)parameters. As shown in Fig. 2 (a), (b), the architecture ofSMPL starts with a triangulated template mesh, T , in restpose, defined by N = 6890 vertices. Given shape and poseparameters (β, θ), 3D offsets are added to the template, cor-responding to shape dependent deformations (BS(β)) andpose dependent deformations (BP (θ)). The resulting mesh

3

(a) T (b) T (β)= T+BS (β) (c) T (β,θ)=T (β)+BP (θ) (d) Tclo =T (β,θ)+Sclo(z,θ,c) (e)W (Tclo,J(β),θ,W)

Figure 2: Additive clothed human model. Our contribution is highlighted with yellow background. Following SMPL, ourmodel (a) starts from a template mesh, and linearly adds offsets contributed by (b) individual body shape β, and (c) pose θ;note the deformation on the hips and feet caused by the ballet pose. (d) We further add a clothing layer parametrized by poseθ, clothing type c and a clothing shape variable z. (e) The vertices are then posed using the skinning equation of SMPL.

is then posed using the skinning function W . Formally:

T (β, θ) = T +BS(β) +BP (θ) (1)M(β, θ) = W (T (β, θ), J(β), θ,W) (2)

where the blend skinning functionW (·) rotates the rest posevertices T around the 3D joints J (computed from β), lin-early smoothes them with the blend weightsW , and returnsthe posed vertices M . The pose θ ∈ R3×(23+1) is repre-sented by a vector of relative 3D rotations of the 23 jointsand the global rotation in axis-angle representation.

SMPL adds linear deformation layers to an initial bodyshape. Following this, we define clothing as an additionaloffset layer from the body and add it on top of the SMPLmesh, Fig. 2 (d). In this work, we parametrize the cloth-ing layer by the body pose θ, clothing type c and a low-dimensional latent variable z that encodes clothing shapeand structure.

Let Sclo(z, θ, c) be the clothing displacement layer. Weextend Eq. (1) to a clothed body template in the rest pose:

Tclo(β, θ, c, z) = T (β, θ) + Sclo(z, θ, c). (3)

Note that the clothing displacements, Sclo(z, θ, c), are pose-dependent. The final clothed template is then posed withthe SMPL skinning function, Eq. (2):

M(β, θ, c, z) = W (Tclo(β, θ, c, z), J(β), θ,W). (4)

This differs from simply applying blend skinning with fixeddisplacements, as done in e.g. [1, 8]. Here, we train themodel such that pose-dependent clothing displacements inthe template pose are correct once posed by blend skinning.

3.2. Clothing representation

Our representation of clothing as vertex displacementsis not a physical model and cannot represent all types ofclothing, but this approach achieves a balance between ex-pressiveness and simplicity, and has been widely used indeformation modeling [17], 3D clothing capture [41] and

recent work that reconstructs clothed humans from im-ages [1, 8, 60].

Our displacement layer is a graph Gd = (Vd, Ed) that in-herits the SMPL topology: the edges Ed = ESMPL. Vd ∈R3×N is the set of vertices, and the feature on each ver-tex is the 3-dimensional offset vector, (dx, dy, dz), from itscorresponding vertex on the underlying body mesh.

We train our model using 3D scans of people in cloth-ing. To acquire data for the displacements, pairs of(Vclothed,Vminimal) are needed, where Vclothed stands for thevertices of a clothed human mesh, and Vminimal the verticesof a minimally-clothed mesh. To do so, we first scan sub-jects in both clothed and minimally-clothed conditions, andthen use the SMPL model + free deformation [1, 58] to reg-ister the scans. As a result, we obtain SMPL meshes captur-ing the geometry of the scans, the corresponding pose pa-rameters, and vertices of the unposed meshes that live in thezero-pose space1. For each (Vclothed,Vminimal) pair, the dis-placements are then calculated as Vd = Vclothed − Vminimal,where the subtraction is performed per-vertex along the fea-ture dimension. Ideally, Vd has non-zero values only onbody parts covered with clothes.

To summarize, our method extends the SMPL bodymodel to clothed bodies. Compared to volumetric represen-tation of clothed people [35, 46, 50, 59], our combination ofthe body model and the garment layer is superior in the easeof re-posing and garment retargeting: the former uses thesame blend skinning as the body model, while the latter is asimple addition of the displacements to a minimally-clothedbody shape. In contrast to similar models that also dressSMPL with offsets [1, 8], our garment layer is parametrized,low-dimensional, and pose-dependent.

4. CAPEOur clothing term Sclo(z, θ, c) in Eq. (3) is a function of

z, a code in a learned low-dimensional latent space that en-

1We follow SMPL and use the T-pose as the zero-pose. For the mathe-matical details of registration and unposing, we refer the reader to [58].

4

(𝑑#, 𝑑% ,𝑑&)

𝑧𝜃𝑐 𝜃𝑐 𝜃𝑐 𝜃𝑐

Variational Autoencoder - 𝐺 Discriminator - 𝐷

Real/Fake (patchwise)

𝜃𝑐

Downsampling Upsampling Fully Connected 𝜃 𝑐 Residual Block

(a) (b)Figure 3: Overview of our method. (a) Computing displacement graph from scan data (Sec. 3.2) by subtracting minimal bodyfrom clothed body mesh in unposed space. (b) Schematic of our network architecture. Details are provided in Appendix A.1.

codes the shape and structure of clothing, body pose θ, andclothing type c. The function outputs the clothing displace-ment graph Gd as described in Sec. 3.2. We parametrizethis function using a graph convolutional neural network(Graph-CNN) as a VAE-GAN framework [16, 25, 30].

4.1. Network architecture

As shown in Fig. 3, our model consists of three majorcomponents: a generator G with an encoder-decoder archi-tecture, a discriminator D. We also use auxiliary networksC1, C2 to handle the conditioning. The network is differen-tiable and is trained end-to-end.

For simplicity, we use the following notation in this sec-tion. x: the vertices Vd of the input displacement graph; x:vertices of the reconstructed graph; θ and c: the pose andclothing type condition vector; z: the latent code.

Graph generator. We build the graph generator followingthe VAE-GAN framework. At training, an encoder Enc(·)takes in the displacement x, extract its features through mul-tiple graph convolutional layers, and maps it to the low-dimensional latent code z. A decoder is trained to recon-struct the input graph x = Dec(z) from z. Both the en-coder and decoder are feed-forward neural networks builtwith mesh convolutional layers. Fully-connected layers areused at the end of the encoder and the beginning of the de-coder. The full architecture is shown in Appendix A.1.

Stacking graph convolution layers causes a loss of localfeatures [32] in the deeper layers. This is undesirable forclothing generation because fine details, corresponding towrinkles, are likely to disappear. Therefore, we improvethe standard graph convolution layers with residual connec-tions, which enable the use of low-level features from thelayer input if necessary. Within a residual block, a 1 × 1convolution is performed aggregating features from the ver-tex itself.

At test time, the encoder is not needed. Instead, z is sam-pled from the Gaussian prior distribution, and the decoderserves as the graph generator: G(z) = Dec(z). We detaildifferent use cases below.

Patchwise discriminator. To further enhance fine detailsin the reconstructions, we introduce a patchwise discrim-

inator D for graphs, which has shown success in the im-age domain [22, 61]. Instead of looking at the entire gener-ated graph, the discriminator only classifies whether a graphpatch is real or fake based on its local structure. Intuitivelythis encourages the discriminator to only focus on fine de-tails, and the global shape is taken care of by the reconstruc-tion loss.

We implement the graph patchwise-discriminator usingfour graph convolution-downsampling blocks [43]. We adda discriminative real / fake loss for each of the output ver-tices. This enables the discriminator to capture a patch ofneighboring nodes in the reconstructed graph and classifythem as real / fake (see Fig. 3).

Conditional model. We condition the network with bodypose θ and clothing type c. The SMPL pose parameters arein axis-angle representation, and are difficult for the neu-ral network to learn [28, 31]. Therefore, following previouswork [28, 31], we transform the pose parameters into rota-tional matrices using the Rodrigues equation. The clothingtypes are discrete by nature, and we represent them usingone-hot labels.

Both conditions are first passed through a small fully-connected embedding network, C1(θ), C2(c), respectively,so as to balance the dimensionality of learned graph featuresand of the condition features. We also experiment with dif-ferent ways of conditioning the mesh generator: concate-nation in the latent space; appending the condition featuresto the graph features at all nodes in the generator; and thecombination of the two. We find that the combined strategyworks better in terms of network capability and the effect ofconditioning.

4.2. Losses and learning

For reconstruction, we use an L1 loss over the vertices ofthe mesh x, because it encourages less smoothing comparedto L2, given by

Lrecon = Ex∼p(x),z∼q(z|x) [‖G(z, θ, c)− x‖1] . (5)

Furthermore, we apply a loss on the mesh edges to en-courage the generation of wrinkles instead of smooth sur-faces. Let e be an edge in the set of edges, E , of the ground

5

truth graph, and e the corresponding edge in the gener-ated graph. We penalize the mismatch of all correspondingedges by

Ledge = Ee∈E, e∈E [‖e− e‖2] . (6)

We also apply a KL divergence loss between the distri-bution of latent codes and the Gaussian prior

LKL = Ex∼p(x) [KL(q(z|x) ‖ N (0, I))] . (7)

Moreover, the generator and the discriminator are trainedusing an adversarial loss

LGAN = Ex∼p(x) [log(D(x, θ, c))] +

Ez∼q(z|x) [log(1−D(G(z, θ, c)))] ,(8)

where G tries to minimize this loss against the D that aimsto maximize it.

The overall objective is a weighted sum of these lossterms given by

L = Lrecon + γedgeLedge + γklLKL + γganLGAN. (9)

Network training details are provided in Appendix A.2.

5. CAPE Dataset

We build a dataset of 3D clothing by capturing temporalsequences of 3D human body scans with a high-resolutionbody scanner (3dMD LLC, Atlanta, GA). Approximately80K 3D scan frames are captured at 60 FPS, and a meshwith SMPL model topology is registered to each scan to getsurface correspondences. We also scanned the subjects in aminimally-clothed condition to obtain an accurate estimateof their body shape under clothing. We extract the cloth-ing as displacements from the minimally-clothed body asdescribed in Sec. 3.2. Noisy frames and failed registrationsare removed through manual inspection.

The dataset consists of 8 male subjects and 3 female sub-jects, performing a wide range of motion. The subjects gaveinformed written consent to participate and to release thedata for research purposes. There are four types of outfits,which cover a wide range of common garments: T-shirts,long-sleeve shirts, long jerseys, shorts, long pants, etc.; seeAppendix D for details and examples from the dataset.

Compared to existing datasets of 3D clothed humans, ourdataset provides captured data and alignments of SMPL tothe scans, separates the clothing from body, and providesaccurate, captured ground truth body shape under clothing.For each subject and outfit, our dataset contains large posevariations, which induces a wide variety of wrinkle patterns.Since our dataset of 3D meshes has a consistent topology, itcan be used for quantitative evaluation of different Graph-CNN architectures.

6. ExperimentsWe first show the representation capability of our model

and then demonstrate the model’s ability to generate newexamples by probabilistic sampling. We then show an ap-plication to human pose and shape estimation.

6.1. Representation power

3D mesh auto-encoding errors. As our network is basedon a VAE, we can use the reconstruction accuracy to mea-sure the capability of the model to encode while preservingthe original geometry. We compare with a recent convolu-tional mesh autoencoder, CoMA [43], and a linear (PCA)model. We compare to both the original CoMA with a 4×downsampling (denoted as “CoMA-4”), and without down-sampling (denoted “CoMA-1”) to study the effect of down-sampling on over-smoothing. We use the same latent spacedimension nz = 18 (number of principal components in thecase of PCA) and hyper-parameter settings, where applica-ble, for all models.

Table 2 shows the result of per-vertex Euclidean errorwhen using our network to reconstruct the clothing dis-placement graphs from a held-out test set of 5852 examplesin our CAPE dataset. Vertices from the head, fingers, toes,hands and feet are excluded from accuracy computation asthey are not covered with clothing.

Our model outperforms the baselines in the auto-encoding task, meanwhile the reconstructed shape fromour model is probabilistic and pose-dependent. Note that,CoMA here is a deterministic auto-encoder with a focus onreconstruction. Although PCA performs comparably wellwith our method, PCA can not be used directly in the in-ference phase with a pose parameter as input. Furthermore,PCA can not produce probabilistic samples of shapes with-out knowing the data distribution. Our method tackles bothof these scenarios.

Ablation study. We remove key components from ourmodel while keeping all the others, and evaluate the modelperformance; see Table 2. We observe that the discrimi-nator, residual block and edge loss all play important rolesin the model performance. Comparing the performance ofCoMA-4 and CoMA-1, we find that the mesh the downsam-pling layer causes a loss of fidelity. However, even with-out any spatial downsampling, CoMA-1 still underperforms

Table 2: Model capacity in per-vertex auto-encoding error.

Baseline Comparison Ablation Study

Method Error (mm) Removed Error (mm)ComponentPCA 5.65 ± 4.81 Discriminator 5.65 ± 5.18CoMA-1 6.23 ± 5.45 Res-block 5.60 ± 5.21CoMA-4 6.87 ± 5.62 Edge loss 5.93 ± 5.40Ours 5.54 ± 5.09

6

Ground

TruthPCA COMA-4 COMA-1 w/o Discr.

w/o edge

loss

CAPE

full model

Figure 4: Example of reconstruction by the baselines, ab-lated version of our model, and our full model. Pay atten-tion to the hem and wrinkles on upper back. Our model isable to recover both long-range correlations and local de-tails. Best viewed zoomed-in on screen.

our model. This shows benefits of adding the discriminator,residual block and edge loss in our model.

Fig. 4 shows a qualitative comparison of the methods.PCA keeps wrinkles and boundaries, but long-range cor-relations are missing: the rising hem on the left side dis-appears. CoMA-1 and CoMA-4 are able to capture globalcorrelation, but the wrinkles tend to be smoothed. By in-corporating all the key components, our model manages tomodel both local structures and global correlations.

6.2. Conditional generation of clothing

As a generative model, CAPE can be sampled and gen-erate new data. The model has three parameters: z, c, θ (seeEq. (3)). By sampling one of them while keeping the othertwo fixed, we show how the conditioning effects the gener-ated clothing shape.

Sampling. Fig. 5 presents the sampled clothing dressed onunseen bodies, in an variety of poses that are not used intraining. For each subject, we fix the pose θ and clothingtype c, and sample z several times to generate varied cloth-ing shapes. The sampling trick in [14] is used. Here weonly show untextured rendering to highlight the variation inthe generated geometry. As CAPE inherits the SMPL topol-ogy, the generated clothed body meshes are compatible withall existing SMPL texture maps. See Appendix C.2 for acomparison between a CAPE sample and a SMPL sampleapplied with the same texture.

As shown in the figure, our model manages to capturelong-range correlations within a mesh, such as the elevatedhem as the subject raises the arms, and the lateral wrinkle onthe back as he raises arms. The model also synthesizes localdetails such as wrinkles in the armpit area, and boundariesat cuffs and collars.

Pose-dependent clothing deformation. Another practicaluse case of CAPE is to animate an existing clothed body.This corresponds to fixing the clothing shape variable z and

clothing type c, and repose the body by changing pose θ.The challenge here is to have a clothing shape that is con-sistent across poses, yet deforms plausibly. We demonstratethe pose-dependent effect on a test pose in Fig. 6. The dif-ference of the clothing layer between the two poses is cal-culated in the zero-pose space, and shown in color coding.The result shows that the clothing type is consistent whilelocal deformation changes along with pose.

User study of generated examples. To test the realism ofthe conditionally generated results of our method, we per-formed a user study on Amazon Mechanical Turk (AMT).In the study, virtual avatars are dressed in 3D and renderedinto front-view images. Following the protocol from [22],raters are presented with a series of “real vs fake” trials.On each trial, the rater is presented with a “real” mesh ren-der (randomly picked from our dataset, i.e. alignments ofthe scan data) and a “fake” render (mesh generated by ourmodel). Both images are shown side-by-side. The raters areasked to pick the one that they think is real. Each pair of ren-derings is evaluated by 10 raters. More strictly than [22], wepresent both real and fake renderings simultaneously, do notset a time limit for the raters and allow zoom-in for detailedcomparison. In this setting, the best score that a method canobtain is 50%, meaning that the real and fake examples areindistinguishable.

We carry out AMT evaluation with two test cases. InTest Case 1, we fix the clothing type to be “shortlong” (themost common clothing type in training), and generate 300clothed body meshes with various poses for the evaluation.In Test Case 2, we fix the pose to be an A-pose (the most fre-quent pose in training), and sample 100 examples per cloth-ing type for evaluation. The percentage of the raters thatlabel our synthsized data as “real” is shown in Table 3. Onaverage, in the direct comparison with real data, our syn-thesized data “fools” 35.1% and 38.7% participants, for thetwo test cases respectively.

Table 3: % Raters labeled our result as “real” in AMT test

Test Case 1 Test Case 235.1%± 15.7% 38.7%± 16.5%

6.3. Image fitting

CAPE is fully differentiable with respect to the clothingshape variable z, body pose θ and clothing type c. There-fore, it can also be used in optimization frameworks. Weshow an application of CAPE on the task of reconstruct-ing body mesh from a single image, by enhancing a popu-lar optimization-based method, SMPLify [9]. Specifically,we dress the minimally-clothed output mesh from SMPLifyusing CAPE, project it back to the image using a differen-tiable renderer [20] and optimize for β, θ, z, with respect to

7

Figure 5: Clothing sampled and generated from our CAPE model, applied to four unseen body shapes (green colored), underdifferent poses. Each subject is dressed with a unique clothing type (respectively: “longshort”, “longlong”, “shortlong” and“shortshort”). For a fixed pose and clothing type, we sample the latent space multiple times, generating different clothingshapes. Best viewed zoomed-in on screen.

3.1 cm

0 cm

(a) (b) (c)

Figure 6: Pose-dependent clothing shape. (a) and (b): twoclothing shapes generated from CAPE, with different poseparameters. (c): color-coded difference of the offset cloth-ing layers in (a) and (b), in the zero-pose space.

Table 4: Vertex mean squared error of image fitting resultsmeasured in meters.

Method SMPLify [9] OursPer-vertex MSE 0.0223 0.0189

the silhouette discrepancy.We evaluate our image fitting pipeline on renderings of

120 meshes from the CAPE dataset. To compare, we mea-sure the reconstruction error on SMPLify and our resultsagainst ground truth mesh using mean square vertex er-ror. To eliminate the error introduced by the ambiguity ofhuman scale and distance to the camera, we optimize theglobal scaling and translation of predictions for both meth-ods on each test sample. A mask is applied to exclude error

in the non-clothed regions such as head, hands and feet. Wereport the errors of both methods in Table 4. Our model per-forms 18% better than SMPLify due to its ability to captureclothing shape. More details about the objective function,experimental setup and qualitative results of the image fit-ting experiment are provided in Appendix B.

Furthermore, once a clothed human is reconstructedfrom image, our model can repose and animate it, as well aschanging the subject’s clothes by re-sampling z or clothingtype c. This shows the potential in a wide range of applica-tions.

7. Conclusions, Limitations, Future Work

We have introduced a novel graph-CNN-based gener-ative shape model of that enables us to condition, sam-ple, and preserve fine shape detail in 3D meshes. We usethis to model clothing deformations from a 3D body meshand condition the latent space on body pose and clothingtype. The training data represents 3D displacements fromthe SMPL body model for varied clothing and poses. Thisdesign means that our generative model is compatible withSMPL in that clothing is an additional additive term ap-plied to the SMPL template mesh. This makes it possi-ble to sample clothing, dress SMPL with it, and then an-imate the body with pose-dependent clothing wrinkles. Aclothed version of SMPL has wide applicability in computervision. As shown, we can apply it to fitting the body to im-ages of clothed humans. Another application would use themodel to generate training data of 3D clothed people to trainregression-based pose-estimation methods.

There are a few limitations of our approach that

8

point to future work. As with recent methods that usea displacement-based clothing representation [1, 8], ourmodel only handles garments that have a similar geome-try as human body. A different representation is neededfor skirts and open jackets. While our generated clothingdepends on pose, it does not depend on dynamics. Thisis fine for most slow motions but does not generalize tofaster motions like those in sports. Future work will addressmodeling clothing deformation conditioned on the state ofprevious time steps. Learning a model of clothing texture(albedo) is another promising direction.Acknowledgements: We thank Daniel Scharstein for several re-visions of the manuscript. We thank Joachim Tesch for render-ing the results, Pavel Karasik for the help with Amazon Mechani-cal Turk, Tsvetelina Alexiadis and Andrea Keller for building thedataset. We thank Partha Ghosh, Timo Bolkart and Yan Zhangfor useful discussions. Qianli Ma and Siyu Tang acknowledgefunding by Deutsche Forschungsgemeinschaft (DFG, German Re-search Foundation) Projektnummer 276693517 SFB 1233. GerardPons-Moll is funded by the Emmy Noether Programme, DeutscheForschungsgemeinschaft (DFG, German Research Foundation) -409792180.

Disclosure: Michael J. Black has received research gift funds fromIntel, Nvidia, Adobe, Facebook, and Amazon. While he is a part-time employee of Amazon and has financial interests in Amazonand Meshcapade GmbH, his research was performed solely at, andfunded solely by, MPI.

References[1] Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar,

Christian Theobalt, and Gerard Pons-Moll. Learning to re-construct people in clothing from a single RGB camera.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), Jun 2019. 2, 3, 4, 9

[2] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, ChristianTheobalt, and Gerard Pons-Moll. Detailed human avatarsfrom monocular video. In International Conference on 3DVision (3DV), Sep 2018. 2

[3] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, ChristianTheobalt, and Gerard Pons-Moll. Video based reconstructionof 3D people models. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2018. 2

[4] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt,and Marcus Magnor. Tex2shape: Detailed full human bodygeometry from a single image. In The IEEE InternationalConference on Computer Vision (ICCV). IEEE, Oct 2019. 2,3

[5] Rıza Alp Guler, Natalia Neverova, and Iasonas Kokkinos.Densepose: Dense human pose estimation in the wild. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pages 7297–7306, 2018. 1

[6] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Se-bastian Thrun, Jim Rodgers, and James Davis. Scape: shapecompletion and animation of people. In ACM transactions ongraphics (TOG), volume 24, pages 408–416. ACM, 2005. 1,3

[7] Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and GangHua. CVAE-GAN: fine-grained image generation throughasymmetric training. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 2745–2754,2017. 2

[8] Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt,and Gerard Pons-Moll. Multi-Garment Net: Learning todress 3D people from images. In The IEEE InternationalConference on Computer Vision (ICCV). IEEE, oct 2019. 2,3, 4, 9

[9] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, PeterGehler, Javier Romero, and Michael J Black. Keep it SMPL:Automatic estimation of 3D human pose and shape from asingle image. In The European Conference on ComputerVision (ECCV), pages 561–578. Springer, 2016. 2, 7, 8, 13,14

[10] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Le-cun. Spectral networks and locally connected networks ongraphs. In International Conference on Learning Represen-tations (ICLR), 2014. 3

[11] Michael Defferrard, Xavier Bresson, and Pierre Van-dergheynst. Convolutional neural networks on graphs withfast localized spectral filtering. In Advances in Neural In-formation Processing Systems, pages 3844–3852, 2016. 2,3

[12] Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin,Cordelia Schmid, and Gregory Rogez. Moulding humans:Non-parametric 3d human shape estimation from single im-ages. In The IEEE International Conference on ComputerVision (ICCV), October 2019. 2, 3

[13] Michael Garland and Paul S Heckbert. Surface simplificationusing quadric error metrics. In Proceedings of the 24th an-nual conference on Computer graphics and interactive tech-niques, pages 209–216. ACM Press/Addison-Wesley Pub-lishing Co., 1997. 3

[14] Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, MichaelBlack, and Bernhard Scholkopf. From variational to de-terministic autoencoders. arXiv preprint arXiv:1903.12436,2019. 7

[15] Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, MingYang, and Liang Lin. Instance-level human parsing via partgrouping network. In The European Conference on Com-puter Vision (ECCV), pages 770–785, 2018. 13

[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In Advancesin neural information processing systems, pages 2672–2680,2014. 2, 5

[17] Peng Guan, Loretta Reiss, David A Hirshberg, AlexanderWeiss, and Michael J Black. DRAPE: DRessing Any PEr-son. ACM Trans. Graph., 31(4):35–1, 2012. 3, 4

[18] Peng Guan, Alexander Weiss, Alexandru O Balan, andMichael J Black. Estimating human shape and pose froma single image. In The IEEE International Conference onComputer Vision (ICCV), pages 1381–1388. IEEE. 2

[19] Erhan Gundogdu, Victor Constantin, Amrollah Seifoddini,Minh Dang, Mathieu Salzmann, and Pascal Fua. GarNet: A

9

two-stream network for fast and accurate 3D cloth draping.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 8739–8748, 2019. 3

[20] Paul Henderson and Vittorio Ferrari. Learning single-image3D reconstruction by generative modelling of shape, poseand shading. International Journal of Computer Vision,2019. 7

[21] David T. Hoffmann, Dimitrios Tzionas, Michael J. Black,and Siyu Tang. Learning to train with synthetic humans. InGerman Conference on Pattern Recognition (GCPR), Sept.2019. 1, 2

[22] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei AEfros. Image-to-image translation with conditional adversar-ial networks. In The IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pages 1125–1134, 2017.3, 5, 7

[23] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total cap-ture: A 3D deformation model for tracking faces, hands, andbodies. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 8320–8329, 2018. 1, 3

[24] Angjoo Kanazawa, Michael J Black, David W Jacobs, andJitendra Malik. End-to-end recovery of human shape andpose. In The IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), pages 7122–7131, 2018. 1, 2, 13

[25] Diederik P Kingma and Max Welling. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114, 2013. 5

[26] Thomas N Kipf and Max Welling. Semi-supervised classi-fication with graph convolutional networks. arXiv preprintarXiv:1609.02907, 2016. 3

[27] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, andKostas Daniilidis. Learning to reconstruct 3D human poseand shape via model-fitting in the loop. In The IEEE Inter-national Conference on Computer Vision (ICCV), Oct. 2019.1, 2, 13

[28] Nikos Kolotouros, Georgios Pavlakos, and Kostas Dani-ilidis. Convolutional mesh regression for single-image hu-man shape reconstruction. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 4501–4510, 2019. 1, 2, 5, 12, 13

[29] Zorah Lahner, Daniel Cremers, and Tony Tung. DeepWrin-kles: Accurate and realistic clothing modeling. In The Eu-ropean Conference on Computer Vision (ECCV), pages 698–715. Springer, 2018. 3

[30] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, HugoLarochelle, and Ole Winther. Autoencoding beyond pixelsusing a learned similarity metric. In International Confer-ence on Machine Learning (ICML), pages 1558–1566, 2016.2, 5

[31] Christoph Lassner, Javier Romero, Martin Kiefel, FedericaBogo, Michael J Black, and Peter V Gehler. Unite the peo-ple: Closing the loop between 3D and 2D human represen-tations. In The IEEE Conference on Computer Vision andPattern Recognition (CVPR), pages 6050–6059, 2017. 1, 2,5, 13

[32] Qimai Li, Zhichao Han, and Xiao-Ming Wu. Deeper insightsinto graph convolutional networks for semi-supervised learn-ing. In Thirty-Second AAAI Conference on Artificial Intelli-gence, 2018. 5

[33] Or Litany, Alex Bronstein, Michael Bronstein, and AmeeshMakadia. Deformable shape completion with graph convolu-tional autoencoders. In The IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pages 1886–1895,2018. 2, 3

[34] Matthew Loper, Naureen Mahmood, Javier Romero, GerardPons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG),34(6):248, 2015. 1, 3

[35] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen,Chongyang Ma, Hao Li, and Shigeo Morishima. SiCloPe:Silhouette-based clothed people. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), June2019. 2, 3, 4

[36] Alexandros Neophytou and Adrian Hilton. A layered modelof human body and garment deformation. In InternationalConference on 3D Vision (3DV), 2014. 3

[37] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Pe-ter Gehler, and Bernt Schiele. Neural body fitting: Unifyingdeep learning and model based human pose and shape es-timation. In 2018 International Conference on 3D Vision(3DV), pages 484–494. IEEE, 2018. 1, 2

[38] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, andMichael J. Black. Expressive body capture: 3D hands, face,and body from a single image. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2019. 1,2, 3

[39] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and KostasDaniilidis. Learning to estimate 3D human pose and shapefrom a single color image. In The IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pages 459–468, 2018. 1

[40] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, BjoernAndres, Mykhaylo Andriluka, Peter V Gehler, and BerntSchiele. Deepcut: Joint subset partition and labeling formulti person pose estimation. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), pages4929–4937, 2016. 13

[41] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael JBlack. Clothcap: Seamless 4D clothing capture and retarget-ing. ACM Transactions on Graphics (TOG), 36(4):73, 2017.2, 3, 4

[42] Albert Pumarola, Jordi Sanchez-Riera, Gary P. T. Choi, Al-berto Sanfeliu, and Francesc Moreno-Noguer. 3DPeople:Modeling the geometry of dressed humans. In The IEEE In-ternational Conference on Computer Vision (ICCV), October2019. 15

[43] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, andMichael J Black. Generating 3D faces using convolutionalmesh autoencoders. In The European Conference on Com-puter Vision (ECCV), pages 704–720, 2018. 2, 3, 5, 6, 14

[44] Anurag Ranjan, David T. Hoffmann, Dimitrios Tzionas, SiyuTang, Javier Romero, and Michael J. Black. Learning multi-human optical flow. International Journal of Computer Vi-sion (IJCV), 12 2019. 1

10

[45] Anurag Ranjan, Javier Romero, and Michael J Black. Learn-ing human optical flow. In British Machine Vision Confer-ence (BMVC), 2018. 1, 2

[46] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Mor-ishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-alignedimplicit function for high-resolution clothed human digitiza-tion. In The IEEE International Conference on ComputerVision (ICCV), October 2019. 2, 3, 4

[47] Igor Santesteban, Miguel A Otaduy, and Dan Casas.Learning-based animation of clothing for virtual try-on.arXiv preprint arXiv:1903.07190, 2019. 3

[48] David Smith, Matthew Loper, Xiaochen Hu, ParisMavroidis, and Javier Romero. FACSIMILE: Fast and ac-curate scans from an image in less than a second. In TheIEEE International Conference on Computer Vision (ICCV),October 2019. 2

[49] Carste Stoll, Jurgen Gall, Edilson de Aguiar, SebastianThrun, and Christian Theobalt. Video-based reconstructionof animatable human characters. In ACM SIGGRAPH ASIA,2010. 3

[50] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, ErsinYumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Vol-umetric inference of 3D human body shapes. In The Euro-pean Conference on Computer Vision (ECCV), pages 20–36,2018. 2, 3, 4

[51] Gul Varol, Javier Romero, Xavier Martin, Naureen Mah-mood, Michael J Black, Ivan Laptev, and Cordelia Schmid.Learning from synthetic humans. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages109–117, 2017. 1, 2, 14

[52] Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feast-net: Feature-steered graph convolutions for 3D shape analy-sis. In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2018. 2, 3

[53] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and JovanPopovic. Articulated mesh animation from multi-view sil-houettes. In ACM Transactions on Graphics (TOG), vol-ume 27, page 97. ACM, 2008. 15

[54] Tuanfeng Y Wang, Duygu Ceylan, Jovan Popovic, andNiloy J Mitra. Learning a shared shape space for multimodalgarment design. In SIGGRAPH Asia 2018 Technical Papers,page 203. ACM, 2018. 3

[55] Yuxin Wu and Kaiming He. Group normalization. In TheEuropean Conference on Computer Vision (ECCV), pages 3–19, 2018. 12

[56] Jinlong Yang, Jean-Sebastien Franco, Franck Hetroy-Wheeler, and Stefanie Wuhrer. Estimation of human bodyshape in motion with wide clothing. In The EuropeanConference on Computer Vision (ECCV), pages 439–454.Springer, 2016. 15

[57] Jinlong Yang, Jean-Sebastien Franco, Franck Hetroy-Wheeler, and Stefanie Wuhrer. Analyzing clothing layerdeformation statistics of 3D human motions. In The Euro-pean Conference on Computer Vision (ECCV), pages 237–253, 2018. 3

[58] Chao Zhang, Sergi Pujades, Michael J. Black, and Ger-ard Pons-Moll. Detailed, accurate, human shape estimation

from clothed 3D scan sequences. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), July2017. 4, 15

[59] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, andYebin Liu. DeepHuman: 3D human reconstruction froma single image. In The IEEE International Conference onComputer Vision (ICCV), 2019. 2, 3, 4

[60] Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and RuigangYang. Detailed human shape estimation from a single imageby hierarchical mesh deformation. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), pages4491–4500, 2019. 2, 4

[61] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), 2017.5

11

AppendixA. Implementation DetailsA.1. CAPE network architecture

Here we provide the details of the CAPE architecture, asdiscribed in Sec. 4.1. We use the following notations:

• x: data, x: output (reconstruction) from the decoder,z: latent code, p: the prediction map from the discrim-inator;

• CONVn: Chebyshev graph convolution layer with nfilters;

• CResBlockn: conditional residual block that usesCONVn as filters;

• DSn: linear graph downsampling layer with a spatialdownsample rate n;

• USn: linear graph upsampling layer with a spatial up-sample rate n;

• FC: fully connected layer;

• ReLU: rectified linear units.

Condition module: for pose θ, we remove the parametersthat are not related to clothing, e.g. head, hands, fingers,feet and toes, resulting in 14 valid joints from the body. Thepose parameters from each joint are represented by the flat-tened rotational matix (see Sec. 4.1, “Conditional model”).This results in the overall pose parameter R9×14. We feedthis into a small fully-connected network:

θ ∈ R9×14 → FC→ ReLU

→ FC→ zθ ∈ R24

The clothing type c refers to the type of “outfit”, i.e. a com-bination of upper body clothing and lower body clothing.There are four types of outfits in our training data: longlong:long sleeve shirt / T-shirt / jersey with long pants; short-long: short sleeve shirt / T-shirt / jersey with long pants;and their opposites, shortshort and longshort. As the typesof clothing are discrete by nature, we represent them usinga one-hot vector, c ∈ R4, and feed it into a linear layer:

c ∈ R4 → FC→ zc ∈ R8

Encoder:

x ∈ R3×6890 → CONV64 → DS2 → CONV64

→ CONV128 → DS2 → CONV128

→ CONV256 → DS2 → CONV256

→ CONV512 → CONV512

→ CONV1×164

→ FC→ zµ ∈ R18

�

FC→ zσ ∈ R18

Decoder:

z ∈ R18 concat−−−→zθ,zc

z′ ∈ R18+24+8

→ FC→ CONV1×1512

→ CResBlock512 → CResBlock512 → DS2



→ CResBlock64 → CResBlock64

→ CONV3 → x ∈ R3×6890

Discriminator:

x ∈ R3×6890 concat−−−−−−→tile{zθ,zc}

x′ ∈ R(3+24+8)×6890



→ CONV1 → p ∈ R1×431

Conditional residual block: We adopt the graph residualblock from Kolotouros et al. [28] that includes Group Nor-malization [55], non-linearity, graph convolutional layerand graph linear layer (i.e. Chebyshev convolution withpolynomial order of 0). Right after the input to the resid-ual block, we append the condition vector to every inputnode along the feature channel. Our CResBlock is given by

xin ∈ Ri×P concat−−−−−−→tile{zθ,zc}

x′ ∈ R(i+24+8)×P

→ ResBlockj → xout ∈ Rj×P

where xin is the input to the CResBlock. xin has P nodesand i features on each node. ResBlock is the graph residualblock from [28] that outputs j features on each node.

A.2. Training details

The model is trained for 60 epochs, with a batch sizeof 16, using stochastic gradient descent with momentum of0.9. The learning rate starts from an initial value of 2 ×10−3, increases with a warm-up step of 2 × 10−3 / epochfor 4 epochs, and then decays with a rate of 0.99 after everyepoch.

The convolutions use the Chebyshev polynomial of or-der 2 for the generator, and of order 3 for the discriminator.An L2-weight decay with strength 2× 10−3 is used as reg-ularization.

We train our model for males and females separately. Wesplit the male dataset into a training set of 26,574 examplesand 5,852 test examples. The female dataset is split into atraining set of 41,472 examples and a test set of 12,656 ex-amples. Training takes approximately 15 minutes per epochon the male dataset and 20 minutes per epoch on the fe-male dataset. The results reported in Table 2 are based on

12

the male dataset due to its higher variance in the number ofsubjects and poses. We also provide results on the femaledataset in the Table 5.

B. Image FittingHere we detail the objective function, experimental setup

and extended results of the image fitting experiments, asdescribed in Sec. 6.3.

B.1. Objective function

Similar to [31], we introduce a silhouette term to encour-age the shape of the clothed body to match the image evi-dence. The silhouette is the set of all pixels that belong toa body’s projection onto the image. Let S(β, θ, c, z) be therendered silhouette of a clothed body mesh M(β, θ, c, z)(see Eq. (4)), and S be the ground truth silhouette. The sil-houette objective is defined by the bi-directional distancebetween S and S(·):

ES(β, θ, z; c, S,K) =∑x∈S

l(x, S)

+∑x∈S

l(x, S(β, θ, c, z)) (10)

where l(x, S) is the L1 distance from a point x to the closestpoint in the silhouette S. The distance is zero if the point isinside S. K is the camera parameter that is used to renderthe mesh to the silhouette on the image plane. The clothingtype is derived from upstream pipeline and is therefore notoptimized.

For our rendered scan data, the ground truth silhouetteand clothing type are acquired for free during rendering. Forin-the-wild images, this information can be acquired usinghuman-parsing networks, e.g. [15].

After the standard SMPLify optimization pipeline, weapply the clothing layer to the body, and apply an addi-tional optimization step on body shape β, pose θ and cloth-ing structure z, with respect to the overall objective:

Etotal = EJ(β, θ;K,Jest) + λSES(β, θ, z; c, S,K)

+ λθEθ(θ) + λβEβ(β)

+ λaEa(θ) + λzEz(z) (11)

The overall objective is a weighted sum of the silhouetteloss with other standard SMPLify energy terms. EJ is aweighted 2D distance between the projected SMPL jointsand the detected 2D points, Jest. Eθ(θ) is the mixture ofGaussians pose prior term, Eβ(β) the shape prior term,Ea(θ) the penalty term that discourages unnatrual jointbents, andEz(z) the L2-regularizer on z to prevent extremeclothing deformations. For more details about these termsplease refer to Bogo et al. [9].

B.2. Data

We render 120 textured meshes (aligned to SMPL topol-ogy) from the test set of the CAPE dataset that include vari-ations in gender, pose and clothing type, at a resolution of512×512. The ground truth meshes are used for evaluation.Examples of the rendering are shown in Fig. 7.

B.3. Setup

We re-implement the SMPLify work by Bogo et al. [9]in Tensorflow, using the gender neutral SMPL body model.Compared to the original SMPLify, there are two majorchanges. First, we do not include the interpenetration er-ror term, as it slows down the fitting but brings little perfor-mance gain [27]. Second, we use OpenPose for the groundtruth 2D keypoint detection instead of DeepCut [40].

B.4. Evaluation

We measure the mean square error (MSE) betweenground truth vertices VGT and reconstructed vertices fromSMPLify VSMPLify, and from our pipeline (Eq. (11)) VCAPE,respectively. As discussed in Sec. 6.3, to eliminate the influ-ence of the ambiguity caused by focal length, camera trans-lation and body scale, we estimate the body scale s and cam-era translation t for both VSMPLify and VCAPE. Specifically,we optimize the following energy function for V = VSMPLifyand V = VCAPE respectively:

E = argmins,t

1

N

∑i∈C

||s(Vi + t)− VGT,i||2 (12)

where i is vertex index, C the set of clothing vertex indices,and N the number of elements in C. Then, the MSE iscomputed with estimated scale s and translation t using:

MSE =1

N

∑i∈C

||s(Vi + t)− VGT,i||2 (13)

B.5. Extended image fitting results

Qualitative results. Fig. 7 shows the reconstruction resultof SMPLify [9] and our method on rendered meshes fromthe CAPE dataset. Quantitative results can be found in themain manuscript, Table 4.

Limitiations. As our image fitting pipeline is based onSMPLify, it fails when SMPLify fails to predict the correctpose. Recent regression-based methods have achieved im-proved performance on this task [24, 27, 28], and integrat-ing CAPE with them is an interesting future line of work.

C. Extended Experimental ResultsC.1. Reconstruction error on the female dataset

We train and evaluate our model on the female dataset,with the same setup and metric as for the male dataset

13

Figure 7: Qualitative results on the rendered meshes from CAPE dataset. Minimally-clothed fitting results from SMPLify[9] are shown in green; results from our method are shown in blue.

(Sec. A.2). Same as in Sec. 6.1 in the main manuscript, wecompare the result with a PCA model, and two variationsof the Convolutional Mesh Autoencoder (CoMA) [43], asshown in Table 5. Again, our model outperforms all thebaselines.

Table 5: Model representation power on the female datasetmeasured by per-vertex auto-encoding error .

Method Error (mm)

PCA 4.82 ± 3.82CoMA-1 4.69 ± 3.85CoMA-4 4.86 ± 3.96Ours 4.21 ± 3.76

C.2. CAPE with SMPL texture

As our model has the same topology as SMPL, it iscompatible with all existing SMPL texture maps, which aremostly of clothed bodies. Fig. 8 shows an example textureapplied to the standard minimally-clothed SMPL model (asdone in the SURREAL dataset [51]) and to our clothedbody model, respectively. Although the texture creates anillusion of clothing on the SMPL body, the overall shaperemains skinny, oversmoothed, and hence unrealistic. Incontrast, our model, with its improved clothing geometry,matches more naturally with the clothing texture if the cor-rect clothing type is given. This visual contrast becomeseven stronger when the texture map has no shading infor-mation (albedo map), and when the object is viewed in a3D setting.

As a future line of research, one can model the alignment

between the clothing texture boundaries and the underly-ing geometry by learning a texture model that is coupled toshape.

1

2

3

1

2

3

OursSMPL

3

1 1

2

3

2

Figure 8: Front row: A clothing texture applied to theSMPL body and one of our generated clothed bodies. Backrow: respective underlying geometry. Best viewed zoomed-in on screen.

D. CAPE Dataset DetailsElaborating on the main manuscript Sec. 5, our dataset con-sists of:

• 40K registered 3D meshes of clothed human scans foreach gender.

• 8 male and 3 female subjects.

14

Table 6: Comparison with other datasets of clothed humans.

Dataset Captured Body Shape Registered Large Pose Motion High QualityAvailable Variation Sequences Geometry

Inria dataset [56] Yes Yes No No Yes NoBUFF [58] Yes Yes No No Yes YesAdobe dataset [53] Yes No Yes∗ No Yes NoRenderPeople Yes No No Yes No Yes3D People [42] No Yes Yes∗ Yes Yes YesOurs Yes Yes Yes Yes Yes Yes

∗ Registered per-subject, i.e. mesh topology is consistent only within the instances from the same subject.

Figure 9: Examples from the CAPE dataset: we provide accurate minimal-dressed body shape (green), clothed body scanswith large pose and clothing wrinkle variations, all registered to the SMPL mesh topology (blue).

• 4 different types of outfits, covering more than 20pieces of clothing from common clothing types.

• Large variations in pose.

• Precise, captured minimally clothed body shape.

Table 6 shows a comparision with public 3D clothedhuman datasets. Our dataset is distinguished by accuratealignment, consistent mesh topology, ground truth bodyshape scans, and a large variation of poses. These fea-

tures makes it not only suitable for studies on human bodyand clothing, but also for the evaluation of various Graph-CNNs. See Fig. 9 for examples of the dataset.

15

learning to dress 3d people in generative clothing · 2019-12-18 · learning to dress 3d people in...

Documents