5464 ieee transactions on image processing, vol. 28, no...

15
5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019 AttGAN: Facial Attribute Editing by Only Changing What You Want Zhenliang He , Wangmeng Zuo , Senior Member, IEEE, Meina Kan, Member, IEEE, Shiguang Shan , Senior Member, IEEE, and Xilin Chen, Fellow, IEEE Abstract—Facial attribute editing aims to manipulate single or multiple attributes on a given face image, i.e., to generate a new face image with desired attributes while preserving other details. Recently, the generative adversarial net (GAN) and encoder–decoder architecture are usually incorporated to handle this task with promising results. Based on the encoder–decoder architecture, facial attribute editing is achieved by decoding the latent representation of a given face conditioned on the desired attributes. Some existing methods attempt to establish an attribute-independent latent representation for further attribute editing. However, such attribute-independent constraint on the latent representation is excessive because it restricts the capacity of the latent representation and may result in information loss, leading to over-smooth or distorted generation. Instead of imposing constraints on the latent representation, in this work, we propose to apply an attribute classification constraint to the generated image to just guarantee the correct change of desired attributes, i.e., to change what you want. Meanwhile, the recon- struction learning is introduced to preserve attribute-excluding details, in other words, to only change what you want. Besides, the adversarial learning is employed for visually realistic editing. These three components cooperate with each other forming an effective framework for high quality facial attribute edit- ing, referred as AttGAN. Furthermore, the proposed method is extended for attribute style manipulation in an unsupervised manner. Experiments on two wild datasets, CelebA and LFW, show that the proposed method outperforms the state-of-the- art on realistic attribute editing with other facial details well preserved. Manuscript received July 25, 2018; revised January 12, 2019; accepted May 2, 2019. Date of publication May 20, 2019; date of current version August 22, 2019. This work was supported in part by the National Key R&D Program of China under Contract 2017YFA0700800, and in part by the Natural Science Foundation of China under Contract 61671182, Contract 61772496, and Contract 61732004. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Sen-Ching Samson Cheung. (Corresponding author: Shiguang Shan.) Z. He, M. Kan, and X. Chen are with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]; [email protected]; [email protected]). W. Zuo is with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China (e-mail: [email protected]). S. Shan is with the Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (CAS), Beijing 100190, China, also with the University of Chinese Academy of Sciences, Beijing 100049, China, also with the CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai 200031, China, and also with the Peng Cheng Laboratory, Shenzhen 518055, China (e-mail: [email protected]). This article has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the authors. The material includes a PDF containing additional results of facial attribute editing. Digital Object Identifier 10.1109/TIP.2019.2916751 Index Terms—Facial attribute editing, attribute style manip- ulation, adversarial learning. I. I NTRODUCTION T HIS work investigates the facial attribute editing task, which aims to edit a face image by manipulating single or multiple attributes of interest (e.g., hair color, expression, mus- tache and age). For conventional face recognition [1], [2] and facial attribute prediction [3], [4] tasks, significant advances have been made along with the development of deep con- volutional neural networks (CNNs) and large scale labeled datasets. However, it is difficult or even impossible to collect labeled images of a same person with varying attributes, thus supervised learning is generally inapplicable for facial attribute editing. Therefore, researchers turn to generative models such as variational autoencoder (VAE) [5] and generative adversar- ial network (GAN) [6], and make considerable progress on facial attribute editing [7]–[16]. Some existing methods [9]–[12] use different editing mod- els for different attributes, therefore one has to train numerous models for handling various attribute editing subtasks, which is difficult for real deployment. For this problem, the encoder- decoder architecture [7], [8], [13]–[15] seems to be an effec- tive solution for using a single model for multiple attribute manipulation. Therefore, we also focus on the encoder-decoder architecture and develop an effective method for high quality facial attribute editing. With the encoder-decoder architecture, facial attribute edit- ing is achieved by decoding the latent representation from the encoder conditioned on the expected attributes. Based on such framework, the key issue of facial attribute editing is how to model the relation between the attributes and the face latent representation. For this issue, VAE/GAN [7] represents each attribute as a vector, which is defined as the difference between the mean latent representations of the faces with and without this attribute. Then, by adding a single or multiple attribute vectors to a face latent representation, the decoded face image from the modified representation is expected to own those attributes. However, such attribute vector contains highly correlated attributes, thus inevitably leading to unexpected changes of other attributes, e.g., adding blond hair always makes a male become a female because most blond hair objects are female in the training set. In IcGAN [8], the latent representation is sampled from a normal distribution independent of the attributes. In Fader 1057-7149 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: others

Post on 09-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

AttGAN: Facial Attribute Editing by OnlyChanging What You Want

Zhenliang He , Wangmeng Zuo , Senior Member, IEEE, Meina Kan, Member, IEEE,

Shiguang Shan , Senior Member, IEEE, and Xilin Chen, Fellow, IEEE

Abstract— Facial attribute editing aims to manipulate singleor multiple attributes on a given face image, i.e., to generate anew face image with desired attributes while preserving otherdetails. Recently, the generative adversarial net (GAN) andencoder–decoder architecture are usually incorporated to handlethis task with promising results. Based on the encoder–decoderarchitecture, facial attribute editing is achieved by decodingthe latent representation of a given face conditioned on thedesired attributes. Some existing methods attempt to establish anattribute-independent latent representation for further attributeediting. However, such attribute-independent constraint on thelatent representation is excessive because it restricts the capacityof the latent representation and may result in informationloss, leading to over-smooth or distorted generation. Instead ofimposing constraints on the latent representation, in this work,we propose to apply an attribute classification constraint to thegenerated image to just guarantee the correct change of desiredattributes, i.e., to change what you want. Meanwhile, the recon-struction learning is introduced to preserve attribute-excludingdetails, in other words, to only change what you want. Besides,the adversarial learning is employed for visually realistic editing.These three components cooperate with each other formingan effective framework for high quality facial attribute edit-ing, referred as AttGAN. Furthermore, the proposed methodis extended for attribute style manipulation in an unsupervisedmanner. Experiments on two wild datasets, CelebA and LFW,show that the proposed method outperforms the state-of-the-art on realistic attribute editing with other facial details wellpreserved.

Manuscript received July 25, 2018; revised January 12, 2019; acceptedMay 2, 2019. Date of publication May 20, 2019; date of current versionAugust 22, 2019. This work was supported in part by the National KeyR&D Program of China under Contract 2017YFA0700800, and in part bythe Natural Science Foundation of China under Contract 61671182, Contract61772496, and Contract 61732004. The associate editor coordinating thereview of this manuscript and approving it for publication was Dr. Sen-ChingSamson Cheung. (Corresponding author: Shiguang Shan.)Z. He, M. Kan, and X. Chen are with the Key Laboratory of

Intelligent Information Processing, Institute of Computing Technology,Chinese Academy of Sciences (CAS), Beijing 100190, China, and also withthe University of Chinese Academy of Sciences, Beijing 100049, China(e-mail: [email protected]; [email protected]; [email protected]).W. Zuo is with the School of Computer Science and Technology,

Harbin Institute of Technology, Harbin 150001, China (e-mail:[email protected]).S. Shan is with the Key Laboratory of Intelligent Information Processing,

Institute of Computing Technology, Chinese Academy of Sciences (CAS),Beijing 100190, China, also with the University of Chinese Academy ofSciences, Beijing 100049, China, also with the CAS Center for Excellencein Brain Science and Intelligence Technology, Shanghai 200031, China, andalso with the Peng Cheng Laboratory, Shenzhen 518055, China (e-mail:[email protected]).This article has supplementary downloadable material available at

http://ieeexplore.ieee.org, provided by the authors. The material includes aPDF containing additional results of facial attribute editing.Digital Object Identifier 10.1109/TIP.2019.2916751

Index Terms— Facial attribute editing, attribute style manip-ulation, adversarial learning.

I. INTRODUCTION

THIS work investigates the facial attribute editing task,which aims to edit a face image by manipulating single or

multiple attributes of interest (e.g., hair color, expression, mus-tache and age). For conventional face recognition [1], [2] andfacial attribute prediction [3], [4] tasks, significant advanceshave been made along with the development of deep con-volutional neural networks (CNNs) and large scale labeleddatasets. However, it is difficult or even impossible to collectlabeled images of a same person with varying attributes, thussupervised learning is generally inapplicable for facial attributeediting. Therefore, researchers turn to generative models suchas variational autoencoder (VAE) [5] and generative adversar-ial network (GAN) [6], and make considerable progress onfacial attribute editing [7]–[16].Some existing methods [9]–[12] use different editing mod-

els for different attributes, therefore one has to train numerousmodels for handling various attribute editing subtasks, whichis difficult for real deployment. For this problem, the encoder-decoder architecture [7], [8], [13]–[15] seems to be an effec-tive solution for using a single model for multiple attributemanipulation. Therefore, we also focus on the encoder-decoderarchitecture and develop an effective method for high qualityfacial attribute editing.With the encoder-decoder architecture, facial attribute edit-

ing is achieved by decoding the latent representation fromthe encoder conditioned on the expected attributes. Based onsuch framework, the key issue of facial attribute editing ishow to model the relation between the attributes andthe face latent representation. For this issue, VAE/GAN [7]represents each attribute as a vector, which is defined as thedifference between the mean latent representations of the faceswith and without this attribute. Then, by adding a singleor multiple attribute vectors to a face latent representation,the decoded face image from the modified representationis expected to own those attributes. However, such attributevector contains highly correlated attributes, thus inevitablyleading to unexpected changes of other attributes, e.g., addingblond hair always makes a male become a female becausemost blond hair objects are female in the training set.In IcGAN [8], the latent representation is sampled from anormal distribution independent of the attributes. In Fader

1057-7149 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5465

Networks [13], an adversarial process is introduced to forcethe latent representation of an autoencoder to be invariant tothe attributes. However, the attributes portray the character-istics of a face image, which implies the relation betweenthe attributes and the face latent representation is highlycomplex and closely dependent. Therefore, simply imposingthe attribute-independent constraint on the latent representationnot only restricts its capacity but also may result in infor-mation loss, which is harmful to the attribute editing. Froma theoretical perspective, 1) the attribute-independent con-straint amounts to minimizing the mutual information betweenthe attributes and the latent representation, 2) in contrast,the autoencoder objective amounts to maximizing the mutualinformation between the input image (including its attributes)and the latent representation [17]. Therefore, the attribute-independent constraint and the autoencoder objective are con-flictive resulting a compromise performance.With the above limitation of existing methods in mind,

we argue that the invariance of the latent representation tothe attributes is excessive, and what we need is just thecorrect editing of attributes no matter whether the latentrepresentation is invariant to the attributes or not. To thisend, instead of imposing the attribute-independence con-straint on the latent representation [8], [13], we apply anattribute classification constraint to the generated image, justrequiring the correct attribute manipulations, i.e., to “changewhat you want”. Therefore in comparison with IcGAN [8]and Fader Networks [13], the latent representation in ourmethod is constraint free, which guarantees its capacity andflexibility for further attribute editing. Besides, we intro-duce the reconstruction learning for the preservation of theattribute-excluding details1, i.e., we aim to “only change” theexpected attributes while keeping the other details unchanged.Moreover, the adversarial learning is employed for visuallyrealistic editing. In our design, the classification constraintand the reconstruction leaning are respectively applied ontwo separate branches. Therefore, the classification constraint(for correct editing) and the reconstruction leaning (for detailpreservation) are not necessarily conflictive unlike the methodswith attribute-independent constraint, while working collabo-ratively with each other.Our method, referred as AttGAN, can generate visually

more pleasing results with fine facial details (see Fig. 1) incomparison with the state-of-the-arts. Moreover, our AttGANis naturally extended for attribute style manipulation. To sumup, the contribution of this work lies in three folds:

• Properly considering the relation between the attributesand the face latent representation under the principle ofjust satisfying the correct editing objective. Our AttGANremoves the strict attribute-independent constraint fromthe latent representation, and just applies the attributeclassification constraint to the generated image to guar-antee the correct change of the attributes.

• Incorporating the attribute classification constraint,the reconstruction learning and the adversarial learning

1attribute-excluding details mean the other details of a face image exceptfor the expected attributes, such as face identity, illumination and background.

Fig. 1. Facial attribute editing results from our AttGAN. Best viewed incolor and higher resolution (by zooming in).

into a unified framework for high quality facialattribute editing, i.e., the attributes are correctly edited,the attribute-excluding details are well preserved and thewhole image is visually realistic.

• Promising results of multiple facial attribute editing usinga single model. AttGAN outperforms the state-of-the-artswith better perceptual quality for facial attribute editing.Moreover, our method is naturally extended for attributestyle manipulation.

II. RELATED WORK

A. Facial Attribute Editing

There are two types of methods for facial attribute edit-ing, the optimization based ones [18], [19] and the learn-ing based ones [7]–[14], [16]. Optimization based methodsinclude CNAI [18] and DFI [19]. To change a given face toa new face with the expected attributes, CNAI [18] definesan attribute loss as the CNN feature difference between thegiven face and a set of faces with the expected attributes,and then minimizes this loss with respect to the given face.Based on the assumption that CNN linearizes the manifold ofthe natural images into an Euclidean feature subspace [20],DFI [19] first linearly moves the deep feature of the inputface along the direction vector between the faces with andwithout the expected attributes. Then the facial attribute editingis achieved by optimizing the input face to match its deepfeature with the moved feature. Optimization based methodsneed to conduct several or even many optimization iterationsfor each testing image, which are usually time-consuming andunfriendly for real world applications.More popular methods are learning based ones. Li et al. [9]

present to train a deep identity-aware attribute transfer modelto add/remove an attribute to/from a face image by employing

Page 3: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5466 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

an adversarial attribute loss and a deep identity feature loss.Shen and Liu [10] adopt the dual residual learning strategyto simultaneously train two networks for respectively addingand removing a specific attribute. Specializing in makeupattribute, PairedCycleGAN [21] employs a pair of asymmetricnetworks for applying and removing makeup. GeneGAN [12]swaps a specific attribute between two given images byrecombining the information of their latent representation.These methods [9]–[12], however, train different models fordifferent attributes (or attribute combinations), leading to largenumber of models which are also unfriendly for real worldapplications.Moreover, several learning based methods for multiple

facial attribute editing with one model are proposed. InVAE/GAN [7], GAN [6] and VAE [5] are combined to learn alatent representation and a decoder. Then the attribute editingis achieved by modifying the latent representation to ownthe information of expected attributes and then decoding it.Specializing in hair style editing, H-GAN [22] is anotherhybrid model of VAE and GAN with a latent recognizer anda hair style recognizer. The hair style editing is achievedby, first adding the mean hair style vector to the face latentrepresentation, and then decoding the modified latent rep-resentation conditioned on the given hair style. IcGAN [8]separately trains a cGAN [23] and an encoder, requiring thatthe latent representation is sampled from a uniform distrib-ution and therefore independent of the attributes. Then theattribute editing is performed by first encoding an image intothe latent representation and then decoding the representa-tion conditioned on the given attributes. Fader Networks [13]employs an adversarial process on the latent representation ofan autoencoder to learn the attribute-invariant representation.Then, the decoder takes such representation and arbitraryattribute vector as input to generate the edited result. However,the attribute-independent constraint on the latent representa-tion in IcGAN and Fader Networks is excessive, because itharms the representation ability and may result in informa-tion loss, leading to unexpected distortion on the generatedimages (e.g., over smoothing). Kim et al. [14] define differentblocks of the latent code as the representations of differentattributes, and swap several latent code blocks between twogiven images to achieve multiple attribute swapping. DNA-GAN [15] also swap attribute relevant latent blocks betweena given pair of images to make “crossbreed” images. BothKim et al. [14] and DNA-GAN [15] can be viewed as exten-sions of GeneGAN [12] for multiple attributes. StarGAN [16]trains a conditional attribute transfer network via attributeclassification loss and cycle consistency loss. StarGAN andour AttGAN are concurrently and independently proposed2

and share some similar objective functions. Main differencesbetween StarGAN and AttGAN are in two folds: 1) StarGANuses cycle consistency loss while our AttGAN uses reconstruc-tion learning for direct constraint without any cyclic process,2) StarGAN trains a conditional attribute transfer networkand does not involve any latent representation while AttGAN

2StarGAN first appears on 2017.11.24 - http://arxiv.org/abs/1711.09020, andour AttGAN first appears on 2017.11.29 - http://arxiv.org/abs/1711.10678.

uses an encoder-decoder architecture and models the relationbetween the latent representation and the attributes.Image translation task is closely related to facial attribute

editing, which focuses on transforming images among differ-ent domains, e.g., converting a real scene photo to Monetstyle painting [24]. Therefore, the main difference betweenimage translation task and facial attribute editing task is that,image translation specializes in domain level manipulationwhile facial attribute editing specializes in attribute levelmanipulation within the face domain. For image transla-tion, CycleGAN [24] trains two bidirectional transfer modelsbetween two image domains by employing the cycle con-sistency loss and two domain specific adversarial learningprocesses. Lu et al. [25] employ a conditional CycleGAN togenerate a high resolution face image for the low resolutioninput that satisfies the given attributes. UNIT [11] learns toencode the images of two different domains into a commonlatent space, and then decode the latent representation to theexpected domain via the domain specific decoder. Althoughthe image translation methods do not specialize in facialattribute editing, one can adapt some of these methods forour task, e.g., CycleGAN can be used for facial attributeediting by regarding face images with and without the expectedattributes as two different domains. Another closely relatedtask is attribute-to-image, i.e., to generate an image withspecific attributes from the random noise. Different from facialattribute editing, attribute-to-image task focuses on generatingimages from random noise rather than given images. Forthis task, Attribute2Image [26] employs a disentangling con-ditional VAE with a layered representation.Our AttGAN is a learning based method for single or mul-

tiple facial attribute editing, which is mostly motivated by theencoder-decoder based methods VAE/GAN [7], IcGAN [8]and Fader Networks [13]. We mainly focus on the disadvan-tages of these three methods on modeling the relation betweenthe latent representation and the attributes, and propose a novelmethod to solve such problem.

B. Generative Adversarial Networks

Denote by pdata(x) the distribution of the real image x, andpz(z) the distribution of the input. Generative adversarial net(GAN) [6] is a special generative model to learn a generatorG(z) to capture the distribution pdata via an adversarialprocess. Specifically, a discriminator D is introduced to dis-tinguish the generated images from the real ones, while thegenerator G(z) is updated to confuse the discriminator. Theadversarial process is formulated as a minimax game as

minGmax

DEx∼pdata [log D(x)] + Ez∼pz [log(1− D(G(z)))].

(1)

Theoretically, when the adversarial process reaches the Nashequilibrium, the minimax game attains its global optimumpG(z) = pdata [6].GAN is notorious for its unstable training and mode col-

lapse. DCGAN [27] uses CNN and batch normalization [28]for stable training. Subsequently, to avoid mode collapse andfurther enhance the training stability, WGAN [29] minimizes

Page 4: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5467

Fig. 2. Overview of our AttGAN, which contains three main components at training: the attribute classification constraint, the reconstruction learning and theadversarial learning. The attribute classification constraint guarantees the correct attribute manipulation on the generated image. The reconstruction learningaims at preserving the attribute-excluding details. The adversarial learning is employed for visually realistic generation.

the Wasserstein-1 distance between the generated distributionand the real distribution as

minG

max‖D‖L≤1 Ex∼pdata [D(x)] − Ez∼pz [D(G(z))], (2)

where D is constrained to be the 1-Lipschitz function imple-mented by weight clipping. Furthermore, WGAN-GP [30]improves WGAN on the implementation of Lipschitz con-straint by imposing a gradient penalty on the discriminatorinstead of weight clipping. In this work, we adopt WGAN-GPfor the adversarial learning.Several works have been developed for the conditional

generation with given attributes or class labels [23], [31]–[33].Employing an auxiliary classifier or regressor, bothAC-GAN [32] and InfoGAN [33] learn the conditionalgeneration by mapping the generated images back to theconditional signals. Since AC-GAN generates images fromrandom noise, it cannot be directly used for the attributeediting task and it is not trivial to design an AC-GAN variantfor a satisfactory attribute editing performance. In this work,we smartly incorporate the AC-GAN spirit to form theattribute classification constraint, which corporates with theother components for effective facial attribute editing.

III. ATTRIBUTE GAN (ATTGAN)

This section introduces the AttGAN approach for the editingof binary facial attributes, i.e., each attribute is representedby 1/0 for with/without it and all attributes are representedby a 1/0 sequence. As shown in Fig. 2, our AttGAN iscomprised of two basic subnetworks, i.e., an encoder Genc

and a decoder Gdec, together with an attribute classifier Cand a discriminator D. In the following, we describe the designprinciples of AttGAN and introduce the objectives for trainingthese modules. Then we present an extension of AttGANfor attribute style manipulation, e.g., to edit the “Eyeglasses”attribute to sunglasses, black rim glasses or thin rim glasses.

A. Testing Formulation

Given a face image xa with n binary attributes a =[a1, . . . , an], the encoder Genc is used to encode xa into thelatent representation, denoted as

z = Genc(xa). (3)

Then the process of editing the attributes of xa to anotherattributes b = [b1, . . . , bn] is achieved by decoding z condi-tioned on b, i.e.,

xb = Gdec(z, b), (4)

Page 5: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5468 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

where xb is the edited image expected to own the attribute b.Thus the whole editing process is formulated as

xb = Gdec(Genc(xa), b). (5)

B. Training Formulation

It can be seen from Eq. (5) that the attribute editing problemcan be formally defined as the learning of the encoder Genc

and decoder Gdec. This learning problem is unsupervised,because the ground truth of the editing, i.e. xb, is unavailable.On the one hand, the editing on the given face image xa

is expected to produce a realistic image with attributes b. Forthis purpose, an attribute classifier is used to constrain thegenerated image xb to correctly own the desired attributes,i.e., the attribute prediction of xb should be b. Meanwhile,the adversarial learning is employed on xb to ensure itsvisual reality.On the other hand, an eligible attribute editing should

only change those desired attributes, while keeping the otherdetails unchanged. To this end, the reconstruction learn-ing is introduced to 1) make the latent representation zconserve enough information for the later recovery of theattribute-excluding details, 2) enable the decoder Gdec torestore the attribute-excluding details from z. Specifically, forthe given xa, the generated image conditioned on its ownattributes a, i.e.,

xa = Gdec(z, a) (6)

should approximate xa itself, i.e., xa → xa.In summary, the relation between the attributes a/b and the

latent representation z is implicitly modeled in two aspects:1) the interaction between z and b in the decoder shouldproduce an realistic image xb with correct attributes, and 2) theinteraction between z and a in the decoder should produce animage xa approximating the input xa itself.

1) Attribute Classification Constraint: As mentioned above,it is required that the generated image xb should correctlyown the new attributes b. Therefore, we employ an attributeclassifier C to constrain the generated image xb to own thedesired attributes, i.e., C(xb) → b, formulated as follows,

minGenc,Gdec

Lclsg = Exa∼pdata ,b∼pattr [�g(xa, b)], (7)

�g(xa, b) =n∑

i=1−bi logCi (xb)−(1−bi) log(1−Ci(xb)),

(8)

where pdata and pattr indicate the distribution of real imagesand the distribution of attributes, Ci (xb) indicates the predic-tion of the i th attribute, and �g(xa, b) is the summation ofbinary cross entropy losses of all attributes.The attribute classifier C is trained on the input images with

their original attributes, by the following objective,

minC

Lclsc = Exa∼pdata [�r (xa, a)], (9)

�r (xa, a) =n∑

i=1−ai logCi (xa) − (1−ai) log(1−Ci(xa)).

(10)

2) Reconstruction Loss: Furthermore, the reconstructionlearning aims for preservation of attribute-excluding details.To this end, the decoder should learn to reconstruct the inputimage xa by decoding the latent representation z conditionedon the original attributes a. The learning objective is formu-lated as

minGenc,Gdec

Lrec = Exa∼pdata [‖xa − xa‖1], (11)

where we use the �1 loss rather than �2 loss to suppressthe blurriness.

3) Adversarial Loss: The adversarial learning between thegenerator (including the encoder and decoder) and discrimi-nator is introduced to make the generated image xb visuallyrealistic. Following WGAN [29], the adversarial losses for thethe discriminator and generator are formulated as below,

min‖D‖L≤1Ladvd = −Exa∼pdata D(xa) + Exa∼pdata,b∼pattr D(xb),

(12)

minGenc,Gdec

Ladvg = −Exa∼pdata ,b∼pattr [D(xb)], (13)

where D is the discriminator described in Eq. (2). Theadversarial losses are optimized via WGAN-GP [30].

4) Overall Objective: By combining the attribute classifica-tion constraint, the reconstruction loss and the adversarial loss,an unified attribute GAN (AttGAN) is obtained, which can editthe desired attributes with the attribute-excluding details wellpreserved. Overall, the objective for the encoder and decoderis formulated as below,

minGenc,Gdec

Lenc,dec = λ1Lrec + λ2Lclsg + Ladvg , (14)

and the objective for the discriminator and the attribute clas-sifier is formulated as below,

minD,C

Ldis,cls = λ3Lclsc + Ladvd , (15)

where the discriminator and the attribute classifier share mostlayers, λ1, λ2 and λ3 are the hyperparameters for balancingthe losses.

C. Why Are Attribute-Excluding Details Preserved?

The above AttGAN design can be viewed as a multi-taskleaning of attribute editing task with classification loss andface reconstruction task with reconstruction loss, which sharethe entire encoder-decoder network. However, AttGAN onlyconducts the reconstruction learning on the generated imageconditioned on the original attributes a, why the preservationability of attribute-excluding details can be generalized to thegeneration conditioned on another attributes b? We suggest thereason is that, AttGAN transfers the detail preservation abilityfrom the face reconstruction task to the attribute editing task.Since these two tasks share the same input domain and outputdomain, they are very similar tasks with tiny transferability

Page 6: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5469

Fig. 3. Illustration of AttGAN extension for attribute style manipulation. (a) shows the extended framework based on the original AttGAN. θ denotes thestyle controllers and Q denotes the style predictor. (b) shows the visual effect of changing attribute style by varying θ .

gap [36] between them. Therefore, the detail preservationability learned from the face reconstruction task can be easilytransfered to the attribute editing task. Besides, these two tasksare learned simultaneously, therefore such transfer is dynamicand the attribute editing learning does not flush the ability offacial detail reconstruction.

D. Extension for Attribute Style Manipulation

In Sec. III-A and III-B, the attributes are binary represented,i.e., “with” or “without”, which is stiff for real world appli-cations. However, for example, in most cases what one isinterested in is adding a certain style of eyeglasses such assunglasses or thin rim glasses, rather than just with/withouteyeglasses. This problem is more difficult because the labeleddata with attribute style is unavailable. Therefore, we attemptto find an unsupervised way to make the generative modelG = (Genc, Gdec) be controlled by a style controller θ in aninterpretable aspect, e.g., θ controls the “Eyeglasses” to bethin rim, black rim or sunglasses. The solution used in thiswork is to maximize the mutual information between θ andthe generated image xθ ∼ Gdec(Genc(xa), θ ∗ b) in order tomake them highly correlated [31], [33], i.e.,

maxG=(Genc,Gdec)

I (θ; xθ ). (16)

According to [33], the mutual information can be obtained by

I (θ; xθ )=maxQ

Eθ∼p(θ),xθ∼pG (xθ |θ)[log Q(θ |xθ )]+H (θ),

(17)

where H (θ) is a constant and Q(θ |xθ ) is an auxiliary posteriordistribution which can be viewed as a style predictor here. Wesubstitute Eq. (17) into Eq. (16) and abandon the constantH (θ), and get the objective as below,

maxGmax

QEθ∼p(θ),xθ∼pG(xθ |θ)[log Q(θ |xθ )]. (18)

where the inner ‘maxQ’ estimates the mutual information, while

the outer ‘maxG’ maximizes the mutual information. In our

setting, θ indicates the category of the style, therefore Q(θ |xθ )

is a categorical distribution implemented by the softmax outputof a neural network, i.e.,

Q(θ = i |xθ ) = exp(wTi f (xθ ) + bi )∑

jexp(wT

j f (xθ ) + b j ). (19)

Thus, substituting Eq. (19) into Eq. (18) and rewriting theoptimization problem in the form with loss function, we finallyget the objective as below,

minGminw,b, f

Loss = Eθ∼p(θ),xθ∼pG (xθ |θ)[−∑

i

δ(θ = i)

·log exp(wTi f (xθ )+bi)∑

jexp(wT

j f (xθ )+b j )], (20)

where these two ‘min’s are done iteratively. Fig. 3 showsour AttGAN extension with style controller θ and a stylepredictor Q.

IV. IMPLEMENTATION DETAILS

Our AttGAN is implemented by the machine learningsystem Tensorflow [37] and the code is publicly available athttps://github.com/LynnHo/AttGAN-Tensorflow.

A. Network Architecture

Table I and Table II shows the detailed network architecturesof our AttGAN. The discriminator D is a stack of convolu-tional layers followed by fully connected layers, and the clas-sifier C has a similar architecture and shares all convolutionallayers with D. The encoder Genc is a stack of convolutionallayers and the decoder Gdec is a stack of transposed convolu-tional layers. We also employ the U-Net [38] like symmetricskip connections between the encoder and decoder, which havebeen shown to produce high quality results on the imagetranslation task [39]. Architectures for 64 × 64 images areused in the comparisons with VAE/GAN [7], IcGAN [8] andStarGAN [16], and architectures for 128×128 images are usedin the comparisons with Fader Networks [13], Shen et al. [10]and CycleGAN [24]. 384 × 384 images are shown in otherexperiments for better visual effect.

Page 7: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5470 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

TABLE I

NETWORK ARCHITECTURES OF ATTGAN FOR 128+2 IMAGES

TABLE II

NETWORK ARCHITECTURES OF ATTGAN FOR 642 IMAGES

B. Training Details

The model is trained by Adam optimizer [41] (β1 = 0.5,β2 = 0.999) with the batch size of 32 and the learning rateof 0.0002. The coefficients for the losses in Eq. (14) andEq. (15) are set as: λ1 = 100, λ2 = 10, and λ3 = 1,which aims to make the loss values be in the same order ofmagnitude.

V. EXPERIMENTS

Dataset: We evaluate the proposed AttGAN on CelebA [3]and LFW [40] dataset. CelebA contains two hundred thou-sand images, each of which has annotation of 40 binaryattributes (with/without). Thirteen attributes with strong visualimpact are chosen in all our experiments, including “Bald”,“Bangs”, “Black Hair”, “Blond Hair”, “Brown Hair”, “BushyEyebrows”, “Eyeglasses”, “Gender”, “Mouth Open”, “Mus-tache”, “No Beard”, “Pale Skin” and “Age”, which covermost attributes used in the existing works. Officially, CelebAis separated into training set, validation set and testing set.We use the training set and validation set together to train ourmodel while using the testing set for evaluation. Besides, formore comprehensive comparisons to illustrate the robustnessof our AttGAN, the entire LFW dataset with 13,233 imagesis used as another testing set.

Methods: Under the same experimental settings, we com-pare our AttGAN with VAE/GAN [7], IcGAN [8] andStarGAN [16], all of which including our AttGAN aretrained to handle all thirteen attributes with a single model.Besides, we compare our AttGAN with Fader Networks [13],

Shen et al. [10] and CycleGAN [24]. Shen et al. and Cycle-GAN can handle only one attribute with one model. AlthoughFader Networks is capable for multiple attribute editingwith one model, in practice, multiple attribute setting makesthe results blurry. Therefore, for these three baselines, eachattribute has its own specific model. VAE/GAN3, IcGAN4,StarGAN5 and Fader Networks6 are trained by their officialcode, while Shen et al. and CycleGAN are implementedby ourself.

A. Visual Analysis

1) Single Facial Attribute Editing: Firstly, on CelebA [3]testing set, we compare the proposed AttGAN withVAE/GAN [7], IcGAN [8] and StarGAN [16] in terms ofsingle facial attribute editing, shown in Fig. 4a. As can beseen, in some cases VAE/GAN produces unexpected changesof other attributes, for example, all three male inputs becomefemale in VAE/GAN when editing the blond hair attribute.This phenomenon happens because the attribute vectors usedfor editing in VAE/GAN contains highly correlated attributessuch as blond hair and female. Therefore, some other unex-pected but highly correlated attributes are also involved whenusing such attribute vectors for editing. IcGAN performs betteron accurately editing attributes, however, it seriously changesother attribute-excluding details especially the face identity.

3VAE/GAN: https://github.com/andersbll/autoencoding_beyond_pixels4IcGAN: https://github.com/Guim3/IcGAN5StarGAN: https://github.com/yunjey/StarGAN6Fader Networks: https://github.com/facebookresearch/Fader Networks

Page 8: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5471

Fig. 4. Results on CelebA [3] of single facial attribute editing. For each specified attribute, the facial attribute editing here is to invert it, e.g., to edit femaleto male, male to female, mouth open to mouth close, and mouth close to mouth open etc. (a) Comparisons with VAE/GAN [7], IcGAN [8] and StarGAN [16]on editing (inverting) specified attributes. These methods including our AttGAN employ single model for all attributes. (b) Comparisons with CycleGAN [24],Shen et al. [10] and Fader Networks [13] on editing (inverting) specified attributes. These three methods employ different models for different attributes. Bestviewed in color and higher resolution (by zooming in).

This is mainly because IcGAN imposes attribute-independentconstraint and normal distribution constraint on the latentrepresentation, which harms its capacity and results in lossof attribute-excluding information. Compared to VAE/GANand IcGAN, our AttGAN accurately edits both local attributes(bangs, eyeglasses and mouth open) and global attributes(gender), credited to the attribute classification constraintwhich guarantees the correct change of the attributes. AlthoughStarGAN also accurately edit the attributes, but the resultscontain undesired changes, e.g., the skin colors of the firsttwo objects change. In contrast, AttGAN well preserves theattribute-excluding details such as face identity, illumination,and background, credited to that 1) the latent representation

is constraint free, which guarantees its capacity for conserv-ing the attribute-excluding information, 2) the reconstructionlearning explicitly enable the encoder-decoder to preserve theattribute-excluding details on the generated images.Comparisons on CelebA [3] with CycleGAN [24],

Shen et al. [10] and Fader Networks [13] are shownin Fig. 4b. The results of Fader Networks especially onadding eyeglasses are blurry, which is very likely causedby the strict attribute-independent constraint on the latentrepresentation. The results of Shen et al. and CycleGANcontain noise and artifacts. Another observation is that, adding“Mustache” makes the female (the second and fourth inputin Fig. 4b) become male in Shen et al. and CycleGAN. In the

Page 9: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5472 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

Fig. 5. Comparisons on LFW [40] of single facial attribute editing with CycleGAN [24], Shen et al. [10], Fader Networks [13] and StarGAN [16]. Bestviewed in color and higher resolution (by zooming in).

opposite, our AttGAN naturally add the mustache keepingthe female’s characteristic well although the model rarely(or never) sees the female with mustache in the training set,which reflects the AttGAN’s superior ability to disentangleattributes (such as male and mustache) and preserve details.Comparisons on LFW [40] are shown in Fig. 5. As shown,

the competitors produce artifacts to some extent, e.g., femalesbecome males when editing the ‘Mustache’ attribute in Cycle-GAN and Shen et al., and Fader Networks produces blurri-ness on ‘Blond Hair’ and ‘Eyeglasses’ attribute. As for ourAttGAN, more accurate attribute editing result with very fewartifacts are achieved benefited from the elaborate design ofour mechanism for facial attribute editing.

2) Multiple Facial Attribute Editing: All of VAE/GAN [7],IcGAN [8], StarGAN [16] and our AttGAN can simultane-ously edit multiple attributes with a single model, and thuswe investigate these three methods in terms of multiplefacial attribute editing for more comprehensive comparison.

Fig. 6 shows the results of simultaneously editing two orthree attributes.Similar to single attribute editing, some generated

images from VAE/GAN contain undesired changes of otherattributes since VAE/GAN cannot decorrelate highly correlatedattributes. As for IcGAN, distortion of face details and oversmoothing become even more severe, because its constrainedlatent representation lead to worse performance in the morecomplex multiple attribute editing task. By contrast, ourmethod still performs well under complex combinations ofattributes, benefited from the appropriate modeling of therelation between the attributes and the latent representation.

3) Attribute Intensity Control: Directly applicable forattribute intensity control is a side characteristic of ourAttGAN. Although AttGAN is trained with binary attributevalues (0/1), we find that AttGAN can be generalized forcontinuous attribute value in testing phase without any mod-ification to its original design. As shown in Fig. 7, with

Page 10: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5473

Fig. 6. Comparisons on CelebA [3] of multiple facial attribute editing among VAE/GAN [7], IcGAN [8], StarGAN [16] and our AttGAN. For each specifiedattribute combination, the facial attribute editing here is to invert each attribute in that combination.

Fig. 7. Illustration of attribute intensity control. Best viewed in color and higher resolution (by zooming in). (a) Female to male. (b) No pale skin to paleskin. (c) Old to young. (d) No bangs to bangs.

Fig. 8. Exemplar results of attribute style manipulation by using our extended AttGAN.

continuous value in [0, 1] as input, the gradual change of thegenerated images are smooth and natural.

4) Attribute Style Manipulation: Fig. 8 shows the resultsof the AttGAN extension for attribute style manipulation. Ascan be seen, different styles of attributes are dug out, such asdifferent sides of bangs: left, right or middle. The extension

is quite flexible and allows one to select the style he/she isinterested in, rather than a stiff one.

5) High Quality Results and Failures: Fig. 14-16 in supple-mental material shows additional results of high quality imageswith 384 × 384 resolution. Fig. 17 in supplemental materialshows some failures. From observation, these failures are often

Page 11: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5474 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

Fig. 9. Comparisons on CelebA [3] among IcGAN [8], VAE/GAN [7], StarGAN [16] and our AttGAN in terms of (a) facial attribute editing accuracy oftarget attribute and (b) preservation error of the rest attributes. (a) Attribute Editing Accuracy (higher the better). (b) Attribute Preservation Error (lower thebetter).

Fig. 10. Comparisons on CelebA [3] among CycleGAN [24], Shen et al. [10], Fader Networks [13] and our AttGAN in terms of (a) facial attribute editingaccuracy of the target attribute and (b) preservation error of the rest attributes. (a) Attribute Editing Accuracy (higher the better). (b) Attribute PreservationError (lower the better).

caused by the need of large appearance modification, such asediting a face with plenty of hair to “Bald” and adding “Bangs”to a bald face. Besides, another possible cause of the failuresmay be the imbalance data distribution, e.g., only 2.2% imagesin the training set have “Bald” attribute.

B. Quantitative Analysis

Facial Attribute Editing Accuracy/Error: To evaluate thefacial attribute editing accuracy of our AttGAN, an attributeclassifier independent of all methods is used to judge theattributes of the generated faces. This attribute classifier istrained on CelebA [3] dataset and achieves average accuracyof 90.89% per attribute on CelebA testing set. If the attributeof a generated image is predicted the same as the desiredone by the classifier, it is considered a correct generation,otherwise an incorrect one. Besides, we also evaluate theaverage preservation error of the other attributes when editingeach single attribute.For evaluations on CelebA [3], Fig. 9a shows the attribute

editing accuracy of VAE/GAN [7], IcGAN [8], StarGAN [16]and our AttGAN, all of which employ single model formultiple attributes. As can be seen, both AttGAN and Star-GAN achieve much better accuracy than VAE/GAN andIcGAN, especially on “No Beard”, “Pale Skin” and “Age”.Moreover, the preservation errors of the rest attributes ofAttGAN and StarGAN are much lower than VAE/GAN andIcGAN as shown in Fig. 9b. As for the comparisons betweenAttGAN and StarGAN, the attribute editing accuracies ofthem are comparable, but the attribute preservation errorof AttGAN is a bit higher. However, the generated images

of our AttGAN are more natural and realistic than StarGAN(see Fig. 4a and Fig. 6).Furthermore, Fig. 10a and Fig. 10b show the attribute edit-

ing accuracy and preservation error of Fader Networks [13],Shen et al. [10] and CycleGAN [24], which employ onespecific model for each attribute. As can be seen, all threebaselines well edit the attributes which is comparable toAttGAN, but their preservation errors of the rest attributesare higher than AttGAN.For evaluations on LFW [40], Fig 11a and Fig 11b show

the attribute editing accuracy and preservation error, which issimilar to the results of CelebA [3]. Both StarGAN [16] andour AttGAN achieve higher attribute editing accuracy on thetarget attribute, with lower attribute preservation error on therest attributes.

C. Ablation Study: Effect of Each Component

In this part, we evaluate the necessity of the three maincomponents: attribute classification constraint, reconstructionloss and adversarial loss. Besides, we also evaluate the dis-advantage of the attribute-independent constraint. In Fig. 12,we show the results of different combinations of these com-ponents, where all experiments are based on models whichlearn to handle multiple attributes with one network. Row (1)contains the results of our AttGAN’s original setting, whichare natural and well preserve the attribute-excluding details.Without the attribute classification constraint (row (2) of

Fig. 12), the network just outputs the reconstruction imagessince these is no signal to force the network to generatethe correct attributes. Similar phenomenon (but with some

Page 12: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5475

Fig. 11. Comparisons on LFW [40] among CycleGAN [24], Shen et al. [10], Fader Networks [13], StarGAN [16] and our AttGAN in terms of (a) facialattribute editing accuracy of target attribute and (b) preservation error of the rest attributes. (a) Attribute Editing Accuracy (higher the better). (b) AttributePreservation Error (lower the better).

Fig. 12. Effect of different combinations of the four components.

noise) happens when we remove the adversarial loss althoughthe classification constraint is kept (row (3)). One possiblereason is that the training with classification constraint butwithout adversarial loss is similar to making an adversarialattack [42]. Therefore, although the classification constraintexists, the adversarial examples with incorrect attributes stillfool the classifier (by the noise). In conclusion, the classifica-tion constraint does not work without the adversarial learning,or in other words, the adversarial learning helps to avoidadversarial examples. However, this is another topic needingmore theoretical analysis and experiments, which is far beyondthis paper.In row (4) of Fig. 12, we present the results of AttGAN

without reconstruction loss. As shown, although the resulting

attributes are correct, the face identities change a lot accom-panied with many artifacts. Therefore, the reconstruction lossis vital for preserving the attribute-excluding details.Row (5) of Fig. 12 presents the results of the Fader

Networks [13] like setting (attribute-independent constraint +reconstruction learning) and row (6) is AttGAN withattribute-independent constraint. As we can see in the row (5),the Fader Networks like setting works only on eyeglasses,gender and mouth open attributes with unsatisfactory perfor-mance. When we combine the AttGAN losses with the FaderNetworks losses (row (6)), the attributes is correctly edited butthe results contain artifacts and the attribute-excluding detailschange (e.g., the shape of nose and mouth). These experimentsdemonstrates that the attribute-independent constraint on the

Page 13: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5476 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

Fig. 13. Exploration of AttGAN on image style translation. The diagonal ones are the inputs. (a) Season translation. (b) Painting translation.

latent representation is not a favorable solution for facialattribute editing, since it constraints the capacity of the latentcode resulting in information loss and degraded output images.

D. Exploration of Image Translation

Since facial attribute editing is closely related to imagetranslation, we also try our AttGAN on the image styletranslation task where we define the style as a kind of attribute.We employ AttGAN on a season dataset [43] and a paintingdataset [24] and the results are shown in Fig. 13. As we cansee, the results of season are acceptable but the style transla-tion of paintings is not so good accompanied with artifactsand blurriness. Compared to facial attribute editing, imagestyle translation needs more variations on texture and color,a single model might be difficult to simultaneously handle allstyles with large variation. However, AttGAN is a potentialframework which deserves more explorations and extensions.

VI. CONCLUSION AND FUTURE WORK

From the perspective of facial attribute editing, we revealand validate the disadvantage of the attribute-independentconstraint on the latent representation. Further, we prop-erly consider the relation between the attributes and thelatent representation and propose a facial attribute editingmethod, AttGAN, which incorporates attribute classificationconstraint, reconstruction learning, and adversarial learningto form an effective framework for high quality facialattribute editing. Experiments demonstrate that our AttGANcan accurately edit facial attributes, while well preserving theattribute-excluding details, with better visual effect, editingaccuracy and lower editing error than the competing methods.Moreover, the AttGAN is directly applicable for attribute

intensity control and can be extended to attribute style manip-ulation, which shows its potential for further exploration.Although this work specializes in facial attribute editing,

it has potential for attribute editing on general object, suchas person attributes. However, different from facial attributes,person attributes have much larger variations which makesediting attributes of a person more challenging. For example,for “carrying handbag” attribute annotated in Market [44] andDuke [45] dataset, the “handbag” may hangs on the left armor be carried on the right hand. In future work, it is worthwhileto investigate the attribute editing of the general object.

REFERENCES

[1] Y. Sun, X. Wang, and X. Tang, “Deep learning face representation frompredicting 10,000 classes,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2014, pp. 1891–1898.

[2] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unifiedembedding for face recognition and clustering,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 815–823.

[3] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributesin the wild,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015,pp. 3730–3738.

[4] M. Ehrlich, T. J. Shields, T. Almaev, and M. R. Amer, “Facialattributes classification using multi-task representation learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),Jun./Jul. 2016, pp. 752–760.

[5] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” inProc. Int. Conf. Learn. Represent. (ICLR), 2014. [Online]. Available:https://arxiv.org/abs/1312.6114

[6] I. J. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv.Neural Inf. Process. Syst. (NeurIPS), 2014, pp. 1–9.

[7] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther,“Autoencoding beyond pixels using a learned similarity metric,” in Proc.Int. Conf. Mach. Learn. (ICML), 2016, pp. 1558–1566.

[8] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez,“Invertible conditional GANs for image editing,” in Proc. Adv. NeuralInf. Process. Syst. (NeurIPS) Workshops, 2016. [Online]. Available:https://arxiv.org/abs/1611.06355

[9] M. Li, W. Zuo, and D. Zhang. (2016). “Deep identity-aware transfer offacial attributes.” [Online]. Available: https://arxiv.org/abs/1610.05586

Page 14: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

HE et al.: AttGAN: FACIAL ATTRIBUTE EDITING BY ONLY CHANGING WHAT YOU WANT 5477

[10] W. Shen and R. Liu, “Learning residual images for face attribute manip-ulation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jul. 2017, pp. 1225–1233.

[11] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-imagetranslation networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS),2017, pp. 700–708.

[12] S. Zhou, T. Xiao, Y. Yang, D. Feng, Q. He, and W. He, “GeneGAN:Learning object transfiguration and attribute subspace from unpaireddata,” in Proc. Brit. Mach. Vis. Conf. (BMVC), 2017. [Online]. Available:http://arxiv.org/abs/1705.04932

[13] G. Lample, N. Zeghidour, N. Usunier, A. Bordes, L. Denoyer,and M. Ranzato, “Fader networks: Manipulating images by slidingattributes,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2017,pp. 5967–5976.

[14] T. Kim, B. Kim, M. Cha, and J. Kim. (2017). “Unsupervised visualattribute transfer with reconfigurable generative adversarial networks.”[Online]. Available: https://arxiv.org/abs/1707.09798

[15] T. Xiao, J. Hong, and J. Ma, “DNA-GAN: Learning disentan-gled representations from multi-attribute images,” in Proc. Int. Conf.Learn. Represent. (ICLR) Workshops, 2018. [Online]. Available:https://arxiv.org/abs/1711.05415

[16] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Uni-fied generative adversarial networks for multi-domain image-to-imagetranslation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2018, pp. 8789–8797.

[17] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising autoencoders: Learning useful representations in adeep network with a local denoising criterion,” J. Mach. Learn. Res.,vol. 11, pp. 3371–3408, Dec. 2010.

[18] M. Li, W. Zuo, and D. Zhang. (2016). “Convolutional networkfor attribute-driven and identity-preserving human face generation.”[Online]. Available: https://arxiv.org/abs/1608.06434

[19] P. Upchurch et al., “Deep feature interpolation for image contentchanges,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jul. 2017, pp. 6090–6099.

[20] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai, “Better mixing viadeep representations,” in Proc. Int. Conf. Mach. Learn. (ICML), 2013,pp. 552–560.

[21] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “PairedCycleGAN:Asymmetric style transfer for applying and removing makeup,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018,pp. 40–48.

[22] W. Yin, Y. Fu, Y. Ma, Y.-G. Jiang, T. Xiang, and X. Xue, “Learning togenerate and edit hairstyles,” in Proc. 25th ACM Int. Conf. Multimedia,2017, pp. 1627–1635.

[23] M. Mirza and S. Osindero. (2014). “Conditional generative adversarialnets.” [Online]. Available: https://arxiv.org/abs/1411.1784

[24] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-imagetranslation using cycle-consistent adversarial networks,” in Proc. IEEEInt. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2242–2251.

[25] Y. Lu, Y.-W. Tai, and C.-K. Tang, “Attribute-guided face generationusing conditional CycleGAN,” in Proc. Eur. Conf. Comput. Vis. (ECCV),Jun. 2018, pp. 282–297.

[26] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditionalimage generation from visual attributes,” in Proc. Eur. Conf. Comput.Vis. (ECCV), 2016, pp. 776–791.

[27] A. Radford, L. Metz, and S. Chintala, “Unsupervised repre-sentation learning with deep convolutional generative adversar-ial networks,” in Proc. Int. Conf. Learn. Represent. (ICLR),2016. [Online]. Available: https://arxiv.org/abs/1511.06434

[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” in Proc. Int. Conf.Mach. Learn. (ICML), 2015, pp. 448–456.

[29] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adver-sarial networks,” in Proc. Int. Conf. Mach. Learn. (ICML), Aug. 2017,pp. 214–223.

[30] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville,“Improved training of Wasserstein GANs,” in Proc. Adv. Neural Inf.Process. Syst. (NeurIPS), 2017, pp. 5767–5777.

[31] T. Kaneko, K. Hiramatsu, and K. Kashino, “Generative attribute con-troller with conditional filtered generative adversarial networks,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,pp. 7006–7015.

[32] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis withauxiliary classifier GANs,” in Proc. Adv. Neural Inf. Process. Syst.Workshops (NeurIPS), 2016, pp. 2642–2651.

[33] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, andP. Abbeel, “InfoGAN: Interpretable representation learning by informa-tion maximizing generative adversarial nets,” in Proc. Adv. Neural Inf.Process. Syst. (NeurIPS), 2016, pp. 2172–2180.

[34] J. L. Ba, J. R. Kiros, and G. E. Hinton. (2016). “Layer normalization.”[Online]. Available: https://arxiv.org/abs/1607.06450

[35] D. Ulyanov, A. Vedaldi, and V. Lempitsky. (2016). “Instance normal-ization: The missing ingredient for fast stylization.” [Online]. Available:https://arxiv.org/abs/1607.08022

[36] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable arefeatures in deep neural networks?” in Proc. Adv. Neural Inf. Process.Syst. (NeurIPS), 2014, pp. 3320–3328.

[37] M. Abadi et al., “TensorFlow: A system for large-scale machinelearning,” in Proc. 12th USENIX Symp. Oper. Syst. Design Imple-ment. (OSDI), 2016, pp. 265–283.

[38] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in Proc. Int. Conf. Med. ImageComput. Comput.-Assist. Intervent. (MICCAI), 2015, pp. 234–241.

[39] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit. (CVPR), Jul. 2017, pp. 5967–5976.

[40] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller, “Labeled facesin the wild: A database forstudying face recognition in unconstrainedenvironments,” in Proc. Workshop Faces Real-Life Images, Detection,Alignment, Recognit., 2008, pp. 1–15.

[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in Proc. Int. Conf. Learn. Represent. (ICLR), 2015. [Online]. Available:https://arxiv.org/abs/1412.6980

[42] C. Szegedy et al., “Intriguing properties of neural networks,” inProc. Int. Conf. Learn. Represent. (ICLR), 2014. [Online]. Available:https://arxiv.org/abs/1312.6199

[43] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool. (2017).“ComboGAN: Unrestrained scalability for image domain translation.”[Online]. Available: https://arxiv.org/abs/1712.06909

[44] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scal-able person re-identification: A benchmark,” in Proc. IEEE Int. Conf.Comput. Vis. (ICCV), Dec. 2015, pp. 1116–1124.

[45] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang. (2017). “Improvingperson re-identification by attribute and identity learning.” [Online].Available: https://arxiv.org/abs/1703.07220

Zhenliang He received the B.E. degree from theBeijing University of Posts and Telecommunications(BUPT), Beijing, China, in 2011. He is currentlypursuing the Ph.D. degree with the Institute ofComputing Technology (ICT), Chinese Academyof Sciences (CAS), Beijing. His research interestsinclude facial landmark detection, facial attributeediting, and generative adversarial networks.

Wangmeng Zuo (M’09–SM’14) received the Ph.D.degree in computer application technology fromthe Harbin Institute of Technology, Harbin, China,in 2007. He is currently a Professor with the Schoolof Computer Science and Technology, Harbin Insti-tute of Technology. His current research interestsinclude image enhancement and restoration, imageand face editing, object detection, visual tracking,and image classification. He has published over70 papers in top-tier academic journals and confer-ences. He has served as the Tutorial Organizer in

ECCV 2016, and an Associate Editor for the IET Biometrics and the Journalof Electronic Imaging.

Page 15: 5464 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO ...vipl.ict.ac.cn/uploadfile/upload/2019112511573287.pdf · Beijing 100190, China, also with the University of Chinese Academy

5478 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 28, NO. 11, NOVEMBER 2019

Meina Kan (M’14) received the B.S. degree in com-puter science from Shandong University in 2007,and the Ph.D. degree in computer vision from theInstitute of Computing Technology (ICT), ChineseAcademy of Sciences (CAS) in 2013. In 2011,she studied at the Centre for Multimedia and Net-work Technology, School of Computer Engineering,Nanyang Technological University. She is currentlyan Associate Professor with ICT, CAS. Her researchinterests mainly focus on computer vision and pat-tern recognition, especially on face recognition,

transfer learning, deep learning, and weakly supervised learning. She hasserved as the Co-Chair for the ICPR18 Workshop on Deep Learning forPattern Recognition, and the ACCV14 Workshop on Human Identificationfor Surveillance.

Shiguang Shan (M’04–SM’15) received the Ph.D.degree in computer science from the Institute ofComputing Technology (ICT), Chinese Academy ofSciences (CAS), Beijing, China, in 2004, wherehe has been a Full Professor, since 2010, and iscurrently the Deputy Director of the CAS Key Labof Intelligent Information Processing. His researchinterests cover computer vision, pattern recogni-tion, and machine learning. He has published over300 papers, with a total of over 15000 GoogleScholar citations. He served as the Area Chair

for many international conferences, including the ICCV11, ICASSP14,ICPR12/14/19, ACCV12/16/18, FG13/18, BTAS18, and CVPR19. He was arecipient of China’s State Natural Science Award in 2015, and China’s StateS&T Progress Award in 2005 for his research work. He was/is an AssociateEditor of several journals, including the IEEE T-IP, Neurocomputing, CVIU,and PRL.

Xilin Chen (F’16) is currently a Professor withthe Institute of Computing Technology, ChineseAcademy of Sciences (CAS). He has authored onebook and more than 200 papers in refereed jour-nals and proceedings in the areas of computervision, pattern recognition, image processing, andmultimodal interfaces. He is a fellow of the IAPRand CCF. He served as an Organizing Commit-tee Member for many conferences, including theGeneral Co-Chair of FG13/FG18 and the ProgramCo-Chair of ICMI 2010. He is/was an Area Chair

of CVPR 2017/2019 and ICCV 2019. He is currently an Associate Editor ofthe IEEE TRANSACTIONS ON MULTIMEDIA, a Senior Editor of the Journalof Visual Communication and Image Representation, a Leading Editor of theJournal of Computer Science and Technology, and an Associate Editor-in-Chief of the Chinese Journal of Computers and the Chinese Journal of PatternRecognition and Artificial Intelligence.