da-gan: supplementary materialsopenaccess.thecvf.com/content_cvpr_2018/supplemental/1877-supp.pdf000...

3
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #1877 CVPR #1877 CVPR 2018 Submission #1877. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. DA-GAN: Supplementary Materials Anonymous CVPR submission Paper ID 1877 Implementation Details The experimental settings for each task are listed in Table 1.’]’ denotes the number of attention regions that are pre- defined in each task, and ’Instances’ denotes the attended level of the instances. The label Y and the distance metric d(·) are adopted in the optimization of Deep Attention En- coder (DAE) and the instance-level translation. Note that d is jointly trained from scratch with DA-GAN, where ’Res- Block’ denotes a small classifier that consists of 9 residual blocks. The learned attention regions are adaptively con- trolled by the selection of Y , ] and d(·). For example, the instances we learned on tasks conducted on CUB-200-2011 are parts level (birds’ four parts), and for task of Coloriza- tion and domain adaption, the attended instances are objects (flower and characters). Experiments on CUB-200-2011 More results generated by DA-GAN are shown in Figure 2. It can be seen that, given one description, the proposed DA-GAN is capable of generating diverse images accord- ing to the specific description. Comparing with existing text-to-image synthesis works, we train the DA-GAN by un- paired text-image data. Especially, because of our proposed instance-level translation, we can achieve high-resolution (256 × 256) images directly, which is more applicable than StackGAN (it needs two stages to achieve the same resolu- tion). We also showed more results for Pose Morphing in Figure 4. Note that, the target should be bird breeds (im- age collections). Here we just random select one image to represent each bird breeds for reference. Human Face to Animation Face Translation In this experiments, we randomly select 80 celebrities which consists of 12k images for source human face im- ages. We also showed fine-grained translation results in Fig- ure 1. We can see that, with the same person, DA-GAN is capable of generating diverse images, while still remain the certain one’s identity attributes, e.g. big round eyes, dark brown hairs, etc. Datasets Label Y ] Instances d(·) MNIST & SVHN 10 1 object ResBlock CUB-200-2011 200 4 parts VGG FaceScrub 80 4 parts Inception Skeleton-cartoon 20 4 parts VGG CMP [2] None 4 parts L2 Colorization [3] Binary 1 object ResBlock Table 1: Implementation Details. Figure 1: Fine-grained translation results. Translation on Paired Datasets We also conduct experiments on paired datasets. The image quality of ours results is comparable to those produced by the fully supervised approaches while our method learns the mapping without paired supervision. For the task of Skele- ton to cartoon figure translation, we retrieved about 20 car- toon figures which consists of 1200 images on websites, and adopt Pose Estimator by [1] to generate skeletons for each image. The DA-GAN is trained by feeding into skeletons and generate cartoon images. References [1] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi- person pose estimation model. 1 [2] R. Tyleˇ cek and R. ˇ ara. Spatial Pattern Templates for Recog- nition of Objects with Regular Structure, pages 364–374. Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. 1 [3] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to- image translation using cycle-consistent adversarial networks. 1 1

Upload: others

Post on 04-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DA-GAN: Supplementary Materialsopenaccess.thecvf.com/content_cvpr_2018/Supplemental/1877-supp.pdf000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

CVPR#1877

CVPR#1877

CVPR 2018 Submission #1877. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

DA-GAN: Supplementary Materials

Anonymous CVPR submission

Paper ID 1877

Implementation Details

The experimental settings for each task are listed in Table1. ’]’ denotes the number of attention regions that are pre-defined in each task, and ’Instances’ denotes the attendedlevel of the instances. The label Y and the distance metricd(·) are adopted in the optimization of Deep Attention En-coder (DAE) and the instance-level translation. Note that dis jointly trained from scratch with DA-GAN, where ’Res-Block’ denotes a small classifier that consists of 9 residualblocks. The learned attention regions are adaptively con-trolled by the selection of Y , ] and d(·). For example, theinstances we learned on tasks conducted on CUB-200-2011are parts level (birds’ four parts), and for task of Coloriza-tion and domain adaption, the attended instances are objects(flower and characters).

Experiments on CUB-200-2011

More results generated by DA-GAN are shown in Figure2. It can be seen that, given one description, the proposedDA-GAN is capable of generating diverse images accord-ing to the specific description. Comparing with existingtext-to-image synthesis works, we train the DA-GAN by un-paired text-image data. Especially, because of our proposedinstance-level translation, we can achieve high-resolution(256 × 256) images directly, which is more applicable thanStackGAN (it needs two stages to achieve the same resolu-tion). We also showed more results for Pose Morphing inFigure 4. Note that, the target should be bird breeds (im-age collections). Here we just random select one image torepresent each bird breeds for reference.

Human Face to Animation Face Translation

In this experiments, we randomly select 80 celebritieswhich consists of 12k images for source human face im-ages. We also showed fine-grained translation results in Fig-ure 1. We can see that, with the same person, DA-GAN iscapable of generating diverse images, while still remain thecertain one’s identity attributes, e.g. big round eyes, darkbrown hairs, etc.

Datasets Label Y ] Instances d(·)MNIST & SVHN 10 1 object ResBlockCUB-200-2011 200 4 parts VGG

FaceScrub 80 4 parts InceptionSkeleton-cartoon 20 4 parts VGG

CMP [2] None 4 parts L2Colorization [3] Binary 1 object ResBlock

Table 1: Implementation Details.

Figure 1: Fine-grained translation results.

Translation on Paired DatasetsWe also conduct experiments on paired datasets. The imagequality of ours results is comparable to those produced bythe fully supervised approaches while our method learns themapping without paired supervision. For the task of Skele-ton to cartoon figure translation, we retrieved about 20 car-toon figures which consists of 1200 images on websites, andadopt Pose Estimator by [1] to generate skeletons for eachimage. The DA-GAN is trained by feeding into skeletonsand generate cartoon images.

References[1] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and

B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. 1

[2] R. Tylecek and R. Sara. Spatial Pattern Templates for Recog-nition of Objects with Regular Structure, pages 364–374.Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. 1

[3] J. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks.1

1

Page 2: DA-GAN: Supplementary Materialsopenaccess.thecvf.com/content_cvpr_2018/Supplemental/1877-supp.pdf000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020

108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161

162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215

CVPR#1877

CVPR#1877

CVPR 2018 Submission #1877. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2: Experimental Results of text-to-image synthesis.

Figure 3: Results of pose morphing. In each group, the first column is the source image, the second row is target images.The red dashed box labeled the generated images, which possess the target objects pose while remain the source objectsappearance.

2

Page 3: DA-GAN: Supplementary Materialsopenaccess.thecvf.com/content_cvpr_2018/Supplemental/1877-supp.pdf000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

CVPR#1877

CVPR#1877

CVPR 2018 Submission #1877. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 4: The first row is source images, and second row is target images. The translated images are placed in the third row,labeled by red dash box.

Figure 5: Results of human-to-animation faces translation. In each group, the first row is human faces, and the second row istranslated animation faces.

Figure 6: Results of architectural labels-to-photos translation. In each group from left to right are the input of labels, thetranslated architecture photos, and the ground truth.

Figure 7: Results of image colorization. In each group, the input is gray images, and the results are translated color images.

3