supplementary material for multimodal prediction and...

8
Supplementary Material for Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models Ardavan Saeedi Matthew D. Hoffman Stephen J. DiVerdi CSAIL, MIT Google Brain Adobe Research Asma Ghandeharioun Matthew J. Johnson Ryan P. Adams Media Lab, MIT Google Brain Princeton and Google Brain 1 Details of Experiments Hyperparameter settings For training all the models, we use two possible learning rates 0.001 and 0.0001. For the MDN, MLP and CGM-VAE, we use 4 hidden layers with the same number of hidden nodes (500 or 1000) in all the layers. For LBN, we use two deterministic hidden layers with linear activation func- tion same as Dauphin and Grangier (2015) and 500 or 1000 nodes; we also use two stochastic layers with the same number of nodes with sigmoid activation func- tions following Dauphin and Grangier (2015). We try two possible minibatch sizes of 100 and 200. For the models which need the number of mixture components (i.e., CGM-VAE and MDN), we select this number from the set {1, 3, 5, 10}. Finally, for the CGM-VAE model, we choose the dimension of the latent variable from {2, 20}. We choose the best hyperparameter set- ting based on the variational lower bound of the held- out dataset. Baselines We choose a set of reasonable baselines that can cover related models in both domains of mul- timodal prediction and automatic photo enhancement. As mentioned in Section 2, literature on the automatic photo enhancement can be divided into two main cat- egories of models: 1) parametric methods which typi- cally minimize an MSE loss: MLP and MDN baselines capture these methods, and 2) nonparametric meth- ods that are not reasonable baselines for us since their proposed edits are destructive (e.g., Lee et al., 2015) or do not benefit from other users’ information (e.g., Koyama et al., 2016). We also add LBN as a strong baseline since it has been shown that it can outperform the MDN and other standard multimodal prediction baselines (Dauphin and Grangier, 2015). Splitting the datasets We split each dataset into random subsets of train, validation and test in a way that for all three datasets (i.e., casual, frequent and expert users) we have subsets with reasonable sizes. Larger training sets may result in non-representative validation and test sets for our small dataset (i.e., ex- pert users), and larger test and validation sets may result in non-representative training set for the same dataset. The ratio that we used is just one way for having a reasonable size subsets; however, we showed for these random subsets and across all three datasets and for three different evaluation metrics our approach outperforms all the other baselines significantly. To apply the P-VAE model to the experts dataset, we split the image-slider combinations from each of the 5 experts into groups of 50 image-sliders and pretend that each group belongs to a different user. This way we can have more users to train the P-VAE model. However, this means the same expert may have some image-sliders in both train and test datasets. The sig- nificant advantage gained in the experts dataset might be due in part to this way of splitting the experts. Note that there are still no images shared across train and test sets.

Upload: others

Post on 30-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Supplementary Material forMultimodal Prediction and Personalization of Photo Edits

with Deep Generative Models

Ardavan Saeedi Matthew D. Hoffman Stephen J. DiVerdiCSAIL, MIT Google Brain Adobe Research

Asma Ghandeharioun Matthew J. Johnson Ryan P. AdamsMedia Lab, MIT Google Brain Princeton and Google Brain

1 Details of Experiments

Hyperparameter settings For training all themodels, we use two possible learning rates 0.001 and0.0001. For the MDN, MLP and CGM-VAE, we use 4hidden layers with the same number of hidden nodes(500 or 1000) in all the layers. For LBN, we use twodeterministic hidden layers with linear activation func-tion same as Dauphin and Grangier (2015) and 500 or1000 nodes; we also use two stochastic layers with thesame number of nodes with sigmoid activation func-tions following Dauphin and Grangier (2015). We trytwo possible minibatch sizes of 100 and 200. For themodels which need the number of mixture components(i.e., CGM-VAE and MDN), we select this numberfrom the set {1, 3, 5, 10}. Finally, for the CGM-VAEmodel, we choose the dimension of the latent variablefrom {2, 20}. We choose the best hyperparameter set-ting based on the variational lower bound of the held-out dataset.

Baselines We choose a set of reasonable baselinesthat can cover related models in both domains of mul-timodal prediction and automatic photo enhancement.As mentioned in Section 2, literature on the automaticphoto enhancement can be divided into two main cat-egories of models: 1) parametric methods which typi-cally minimize an MSE loss: MLP and MDN baselinescapture these methods, and 2) nonparametric meth-ods that are not reasonable baselines for us since theirproposed edits are destructive (e.g., Lee et al., 2015)or do not benefit from other users’ information (e.g.,Koyama et al., 2016). We also add LBN as a strongbaseline since it has been shown that it can outperformthe MDN and other standard multimodal predictionbaselines (Dauphin and Grangier, 2015) .

Splitting the datasets We split each dataset intorandom subsets of train, validation and test in a waythat for all three datasets (i.e., casual, frequent andexpert users) we have subsets with reasonable sizes.Larger training sets may result in non-representativevalidation and test sets for our small dataset (i.e., ex-pert users), and larger test and validation sets mayresult in non-representative training set for the samedataset. The ratio that we used is just one way forhaving a reasonable size subsets; however, we showedfor these random subsets and across all three datasetsand for three different evaluation metrics our approachoutperforms all the other baselines significantly.

To apply the P-VAE model to the experts dataset, wesplit the image-slider combinations from each of the5 experts into groups of 50 image-sliders and pretendthat each group belongs to a different user. This waywe can have more users to train the P-VAE model.However, this means the same expert may have someimage-sliders in both train and test datasets. The sig-nificant advantage gained in the experts dataset mightbe due in part to this way of splitting the experts. Notethat there are still no images shared across train andtest sets.

Page 2: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models

2 Sample edits from CGM-VAE andsample user categorization fromP-VAE

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 4.75

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

P

E

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 1: Image 4876 from Adobe-MIT5k dataset

Page 3: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Saeedi, Hoffman, DiVerdi, Ghandeharioun, Johnson, Adams

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 3.22

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

P

ETe

mpe

ratu

reTi

ntEx

posu

reC

ontr

ast

Hig

hlig

hts

Shad

ows

Whi

tes

Blac

ksC

lari

tyV

ibra

nce

Satu

rati

onBr

ight

ness

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 2: Image 4855 from Adobe-MIT5k dataset

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 2.51

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 3: Image 4889 from Adobe-MIT5k dataset

Page 4: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 2.62

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 4: Image 4910 from Adobe-MIT5k dataset

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 3.11

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 5: Image 4902 from Adobe-MIT5k dataset

Page 5: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Saeedi, Hoffman, DiVerdi, Ghandeharioun, Johnson, Adams

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 3.54

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

P

ETe

mpe

ratu

reTi

ntEx

posu

reC

ontr

ast

Hig

hlig

hts

Shad

ows

Whi

tes

Blac

ksC

lari

tyV

ibra

nce

Satu

rati

onBr

ight

ness

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 6: Image 4873 from Adobe-MIT5k dataset

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 6.41

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

P

E

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 7: Image 4882 from Adobe-MIT5k dataset

Page 6: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models

Pred

icti

ons

Expe

rts

Original imageAvg. LAB error: 6.99

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

P

E

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

Tem

pera

ture

Tint

Expo

sure

Con

tras

tH

ighl

ight

sSh

adow

sW

hite

sBl

acks

Cla

rity

Vib

ranc

eSa

tura

tion

Brig

htne

ss

�1.5�1.2�0.9�0.6�0.30.00.30.60.91.21.5

Figure 8: Image 4872 from Adobe-MIT5k dataset

Figure 9: A sample user categorization from the P-VAE model (car image).

Page 7: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Saeedi, Hoffman, DiVerdi, Ghandeharioun, Johnson, Adams

Figure 10: A sample user categorization from the P-VAE model (similar photos with dominant blue colors).

Figure 11: A sample user categorization from the P-VAE model (flowers).

Page 8: Supplementary Material for Multimodal Prediction and ...proceedings.mlr.press/v84/saeedi18a/saeedi18a-supp.pdf · Multimodal Prediction and Personalization of Photo Edits with Deep

Multimodal Prediction and Personalization of Photo Edits with Deep Generative Models

References

Y. N. Dauphin and D. Grangier. Predicting distribu-tions with linearizing belief networks. arXiv preprintarXiv:1511.05622, 2015.

Y. Koyama, D. Sakamoto, and T. Igarashi. Selph: Pro-gressive learning and support of manual photo color en-hancement. In Proceedings of the 2016 CHI Conferenceon Human Factors in Computing Systems, pages 2520–2532. ACM, 2016.

J.-Y. Lee, K. Sunkavalli, Z. Lin, X. Shen, and I. S.Kweon. Automatic content-aware color and tone styl-ization. arXiv preprint arXiv:1511.03748, 2015.