unsupervised image translation - people

7
Appears in: Proc. International Conference on Computer Vision, October 2003. Unsupervised Image Translation omer Rosales, Kannan Achan, and Brendan Frey Probabilistic and Statistical Inference Laboratory University of Toronto, Toronto, ON, Canada {romer,kannan,frey}@psi.toronto.edu Abstract An interesting and potentially useful vision/graphics task is to render an input image in an enhanced form or also in an unusual style; for example with increased sharpness or with some artistic qualities. In previous work [10, 5], researchers showed that by estimating the mapping from an input image to a registered (aligned) image of the same scene in a different style or resolution, the mapping could be used to render a new input image in that style or res- olution. Frequently a registered pair is not available, but instead the user may have only a source image of an un- related scene that contains the desired style. In this case, the task of inferring the output image is much more diffi- cult since the algorithm must both infer correspondences between features in the input image and the source im- age, and infer the unknown mapping between the images. We describe a Bayesian technique for inferring the most likely output image. The prior on the output image P (X ) is a patch-based Markov random field obtained from the source image. The likelihood of the input P (Y|X ) is a Bayesian network that can represent different rendering styles. We describe a computationally efficient, probabilis- tic inference and learning algorithm for inferring the most likely output image and learning the rendering style. We also show that current techniques for image restoration or reconstruction proposed in the vision literature (e.g., im- age super-resolution or de-noising) and image-based non- photorealistic rendering could be seen as special cases of our model. We demonstrate our technique using several tasks, including rendering a photograph in the artistic style of an unrelated scene, de-noising, and texture transfer. 1 Introduction and Related Work We pursue a formal method for modifying image statistics while maintaining image content. By appropriately manip- ulating image statistics, one could for example, improve image clarity or modify the image appearance into a more convenient, preferable, or useful one. For instance, one ap- plication of our approach is demonstrated in Fig. 1. We use a known painting (top) to specify the attributes that we would like to transfer to our input image (middle). Our algorithm infers the latent image (bottom), which displays the style specified by the painting. Many fundamental problems in image processing are specific cases of the above problem. In image de-noising one seeks to remove unwanted noise from a given image to achieve a visually clearer one. In image super-resolution, given a low-resolution image, the goal is to estimate a high- resolution version of that same image. More generally, in image restoration we seek to discover what the original im- age looked like before it underwent the effect of some un- known (or partially known) process. In this paper, we will maintain the scope of this prob- lem general. Thus, we refer to our method as that of image translation, suggesting a similar process to that of language translation (despite their clear conceptual differ- ences). Besides approaching the above restoration prob- lems, our method allows, in general, to intervene in the image properties. For example, one could try to increase Figure 1: Image μ, Starry Night by van Gogh, is used to specify the source patches (top), input image Y (center), inferred latent image X (bottom). 1

Upload: others

Post on 03-Jun-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unsupervised Image Translation - People

Appears in: Proc. International Conference on Computer Vision, October 2003.

Unsupervised Image Translation

Romer Rosales, Kannan Achan, and Brendan FreyProbabilistic and Statistical Inference Laboratory

University of Toronto, Toronto, ON, Canada{romer,kannan,frey}@psi.toronto.edu

AbstractAn interesting and potentially useful vision/graphics taskis to render an input image in an enhanced form or alsoin an unusual style; for example with increased sharpnessor with some artistic qualities. In previous work [10, 5],researchers showed that by estimating the mapping froman input image to a registered (aligned) image of the samescene in a different style or resolution, the mapping couldbe used to render a new input image in that style or res-olution. Frequently a registered pair is not available, butinstead the user may have only a source image of an un-related scene that contains the desired style. In this case,the task of inferring the output image is much more diffi-cult since the algorithm must both infer correspondencesbetween features in the input image and the source im-age, and infer the unknown mapping between the images.We describe a Bayesian technique for inferring the mostlikely output image. The prior on the output image P (X )is a patch-based Markov random field obtained from thesource image. The likelihood of the input P (Y|X ) is aBayesian network that can represent different renderingstyles. We describe a computationally efficient, probabilis-tic inference and learning algorithm for inferring the mostlikely output image and learning the rendering style. Wealso show that current techniques for image restoration orreconstruction proposed in the vision literature (e.g., im-age super-resolution or de-noising) and image-based non-photorealistic rendering could be seen as special cases ofour model. We demonstrate our technique using severaltasks, including rendering a photograph in the artistic styleof an unrelated scene, de-noising, and texture transfer.

1 Introduction and Related WorkWe pursue a formal method for modifying image statisticswhile maintaining image content. By appropriately manip-ulating image statistics, one could for example, improveimage clarity or modify the image appearance into a moreconvenient, preferable, or useful one. For instance, one ap-plication of our approach is demonstrated in Fig. 1. Weuse a known painting (top) to specify the attributes that wewould like to transfer to our input image (middle). Ouralgorithm infers the latent image (bottom), which displaysthestyle specified by the painting.

Many fundamental problems in image processing arespecific cases of the above problem. In image de-noisingone seeks to removeunwanted noise from a given image toachieve a visually clearer one. In image super-resolution,given a low-resolution image, the goal is to estimate a high-resolution version of that same image. More generally, inimage restoration we seek to discover what the original im-age looked like before it underwent the effect of some un-known (or partially known) process.

In this paper, we will maintain the scope of this prob-lem general. Thus, we refer to our method as that ofimage translation, suggesting a similar process to that of

language translation (despite their clear conceptual differ-ences). Besides approaching the aboverestoration prob-lems, our method allows, in general, to intervene in theimage properties. For example, one could try to increase

Figure 1:Imageµ, Starry Night by van Gogh, is used to specifythe source patches (top), input imageY (center), inferred latentimageX (bottom).

1

Page 2: Unsupervised Image Translation - People

the image photorealism (e.g.,relative to an originally non-photorealistic version) or change the artistic appearance ofimages (e.g.,change a painting style into another).

A variety of recent and past research work have beenproposed to approach specific problems from those men-tioned above,e.g.,[9, 26, 5, 10, 16, 23, 28, 24, 21]. Fromthese, our approach is most closely related to [5, 26, 9]since the joint distribution of the latent image (i.e.,imageto be estimated) is represented as a Markov Random Field(MRF) with pairwise potentials. However, there are signif-icant differences, in our work (including the latent imagerepresentation). The primary differences pertain to:

(1) the inclusion ofunknown transformation functionsthat relate (probabilistically) latent and observed imagepatches. Thus, the joint distribution of the model is dif-ferent from that of [5, 26, 9], where we can think of thesefunctions as known, as explained in Sec. 5. Moreover, herewe allow for multiple transformation functions. We alsoshow how these transformations can be estimated, insteadof assuming that they are given.

(2) the conditional probability distribution of a patch inthe latent image given a patch in the observed image can-not be estimateddirectly. In our case this distribution isun-known. In previous work, it was assumed that for training,an original and modified version of the source or exampleimages were given (supervised learning). In our work wedrop this assumption, we only have patches from a sourceimage with different properties than the input image. Thisdifference is related to the first one, since knowledge ofthe transformations would make this conditional probabil-ity partially (or fully) known.

(3) the derivation of new algorithms for inference. Theestimation of new parameters, the use of different condi-tional independence assumptions, and the incorporation ofnew hidden random variables make estimation and infer-ence a more complex task.

In the graphics literature, our approach is also relatedto [10], in the area of non-photorealistic rendering (alsorelated to [3, 4]). In [10], a user presents two (source) im-ages with the same content and aligned, but with two dif-ferent styles. Given a new (input) image in one of the abovestyles, the system then transforms it into a new image dis-playing the other image style. A nearest neighbor algo-rithm is then used to match image features/pixels locally(similar to [3, 21]) and globally. Excellent results wereobtained using this method. However, full supervision isrequired, since the user has to present two well aligned im-ages displaying the desired relationship.

One common disadvantage of previous methods is thatfrequently a registered image pair is not available, but in-stead the user may only have the input image and also asource image of an unrelated scene that contains the ap-propriate style. In this case, the task of inferring the out-put image is much more difficult, since the algorithm mustboth infer correspondences between features in the inputimage and the source image, and infer the unknown map-ping between the images. We propose a novel approach tosolving this problem.

We formalize the problem and solution using a prob-abilistic approach, in the context of graphical models[20, 6]. The full joint distribution in our model is repre-sented by a chain graph [15], a class of graph-representeddistributions that is a superset of Bayes networks andMarkov random fields [20]. In our chain graph, part of

the nodes are associated to image patches which are to beinferred. Also, a set of patch transformations is used to en-force consistency in the transformation used across the im-age. These transformations are estimated by our algorithm,thus enabling us to discover transformations that relate theobserved and estimated image.

We cast this problem into an approximate probabilisticinference/learning problem and show how it can be ap-proached using belief propagation and expectation maxi-mization (EM). Our image translation method is also ap-pealing because of its generality. Most of the above ap-proaches for image reconstruction or transformation canbe seen as instances of the approach presented here, as willbe discussed later.

2 Image Translation as Probabilistic Infer-ence

We will now introduce the image translation problem.First, the problem is presented at an intuitive level. Despiteseveral simplifications, the simple description in this sec-tion may become helpful later in the paper. We then makethese ideas more precise by introducing a formal mathe-matical formulation.

2.1 Overview: Informal Problem DescriptionAn intuitive way to summarize the image translation prob-lem in the context proposed in this paper is to think of thetask of finding an imageX that satisfies certain inter-patch(e.g.,smoothness) and within-patch (e.g.,contrast) proper-ties and that produces the observed imageY after everypatch undergoes one of several transformations.

In contrast with previous work, we do not want to as-sume that we know in advance these patch transformations.Also, we will most likely not be able to satisfyexactly allthe above properties for the new imageX . Thus, we con-sider a probabilistic approach where given an original im-ageY, we try to construct a probable imageX (1) whichis formed by taking sample patches from a patch data-set(or dictionary), (2) whose patches are constrained to ap-proximately satisfy certain local inter-patch properties, and(3) whose patches arerelated to corresponding patches inthe original imageY by one or more unknown but lim-ited patch transformations. We would also like to esti-mate these patch transformations, the degree of certaintywe should have in these transformations for each patch,and how probable a reconstructed image is with respect toanother,e.g.,a posterior marginal probability distributionoverX .

2.2 Definitions and SetupWe consider image representations based on a set of local,possibly overlapping, patches defined simply as a set ofpixels. LetY ∈ IY be an image formed by a set of patchesyp, with p ∈ P , P = {1, .., P} andyp ∈ �S . In thispaper,S is the number of pixels in each patch (this assumesone real value per pixel; however S could also account forrepresentations using multiple channels)1. We will call Ythe input or observed image. Consider also a latent imageX ∈ IX , with patchesxp ∈ �T ; X will be the image to beestimated, and therefore it is unknown.

1In general,yp does not have to represent pixels, but for example,filter responses

Page 3: Unsupervised Image Translation - People

Let µ ∈ IX denote a known image (or concatenation ofimages), here simply referred to as the source or dictionaryimage. Assume that the set of patches inµ are a represen-tative sample which possesses the patch statistics that wewishX to display.

Consider a set of patch transformations (i.e.,patch trans-lators)Λ = {Λl}l={1,...,L}, whereΛl : �T → �S . Inour model, the task of a singleΛl is to transform a latentpatch into an observed patch. These transformations areinitially unknown, and we will try to estimate them. Es-timating them intuitively accounts for discovering the setof ’painters’ (or styles) that were used to paint imageYfrom an imageX (note that we would like to achieve theinverse process). There can be multiple painters, perhapseach of them specializing in a particular image transfor-mation. The finite random variablelp will represent theindex of the transformation employed to transform patchxp; l = (l1, ..., lP ) is thus the vector with the indices of thepatch transformations for every patch inX .

We will consider another class of transformations, topo-logical transformations, that can perform, for example,horizontal and vertical patch translation. These will beused as follows: a patch fromµwill be moved horizontallyand vertically to be positioned at a new location in imageX . This class of 2D transformations (2D translations) aredefined to be the finite set of sparse permutation matricesΓ, in a manner similar to [7]. In our case, each element isa matrixΓk of dimensionsS × |µ|, with |µ| the number ofpixels in the dictionary imageµ. Thus, for simplicity, inthe future the dictionary imageµ is represented as a long1D vector formed by stacking all the patches. We then re-strict Γk to be a binary sparse permutation matrix of theform: [

0 ... 0 1 0 0 0 ... 00 ... 0 0 1 0 0 ... 00 ... 0 0 0 1 0 ... 0

], (1)

which only accounts for copying and translating a groupof neighboring pixels by an integer number of pixels hori-zontally and vertically. This restriction is not necessary, itis straightforward to consider other classes of transforma-tions such as rotation or shearing, using the same represen-tation. We use the random variablet to denote thet − thelement of the setΓ, andt = (t1, ..., tP ) to represent thetopological transformations for all patches in the image.

2.3 Probabilistic ModelWe defined our model to have joint probability distribu-tion represented by the chain graph of Fig. 2, which canbe factorized as the product of local conditional prob-abilities associated with the directed edges and positivepotential functions associated with the undirected edges[15, 20, 6, 14]:

p(Y,X , l, t|Γ,Θ, µ) =∏p∈P

p(yp|lp, tp,Γ,Θ, µ)

∏p∈P

P (tp|xp)∏p∈P

P (lp)1Z

∏c∈C

ψc(xp∈c), (2)

with {c ∈ C} denoting the set of latent image patches thatbelong to cliquec in the MRF at the upper layer in Fig. 2,Cthe set of cliques in the MRF sub-graph, andψc the cliquepotential functions.

xj

tj

yj

lj

xi

ti

yi

li

xk

tk

yk

lk

xp

tp

yp

lp

xq

tq

yq

lq

Figure 2: Chain graph representing the model joint probabilitydistribution.

In this paper, every image patchyp follows a Gaussiandistribution given the patch transformation indexl p and thetopological transformationtp, with the conditional inde-pendence properties as shown in the graph in Fig. 3:

p(yp|lp, tp,Γ,Θ, µ) = N (yp; Λlp(Γtpµ),Ψp), (3)

whereΘ = {Λ,Ψ} is used to succinctly denote the distri-bution parameters andΨ = {Ψp}p∈P . This is representedin Fig. 3, a sub-graph taken from the full graph in Fig. 2,representing local conditional independences for each ob-served image patch. For this paper, we setΛi to be a lineartransformation2. This is not a restriction of our framework,it may be advantageous to also allow non-linear transfor-mations and they could also be incorporated. However,even with linear transformations, the probabilistic modelas a whole is non-linear.

In our case, we considerp(x) to be a pairwise MRF, thusevery cliquec only contains two patchesc1 andc2. Thuswe simply have:

ψ(xc) ∝ ed(xc1 ,xc2)/2σ2(4)

We will defined as a square distance function; thusψ de-fines a Gaussian MRF. However,d is computed only onoverlapping areas in the associated patches, in a way simi-lar to [5].

In order to link our discrete transformation random vari-able with the continuous latent random variableX , we usethe deterministic relationship:

p(tp|xp) ={

1 if xp = Γtpµ0 otherwise. (5)

This is equivalent to saying that a transformationtp willhave conditional probability one, only if its patchxp isequal to the patch taken from the dictionaryµ using trans-formationtp (here we assume that the setΓ is such that itproduces unique patches).

2Since the inferred patches are not the result of a linear function ofthe observed image patches, this is different than simply transforming theimage patches linearly. Instead, by this we are imposing a constraint ontheflexibility that each of theL transformations is allowed to have.

Page 4: Unsupervised Image Translation - People

xjtjyj

lj

Figure

3:S

ub-graphrepresenting

localconditional

indepen-dences

foreach

observedim

agepatch.

The

variableµ

isfixed

forallpatches

andit

isusually

givenin

theform

ofthe

sourceim

age(s).

3A

lgorithms

forL

earning/InferenceLetus

analyzethe

systemassociated

tothe

chaingraph

inF

ig.2.F

irstwe

cansee

thatitcontainsan

undirectedsub-

graphw

ithloops.

Even

ifallthem

odelparameters

Θw

ereknow

n3,inference

(computing

theconditionaldistribution

overtheim

ageX,given

theim

ageY)

isstillcom

putation-ally

intractablein

general(inthe

same

way

asM

AP

esti-m

ationis).

More

specifically,thisproblem

hascom

plexityO

(|K| |P|),w

ith|K|thenum

berofpossiblestates

thateachpatchx

pcan

take.Learning

them

odelparameters

isalso

intractable,ascan

beseen

fromcom

putingthe

derivativesof

Eq.

2w

ithre-

spectto

thevariables

ofinterest.H

owever,there

existap-proxim

ationalgorithm

s;one

ofthem

,based

onalternat-

ingoptim

izations[1],

isE

xpectationM

aximization

(EM

).E

Mrequires

computing

posteriordistributionsoverlandt,

thatin

turnrequires

computing

conditionalmarginaldis-

tributionsfor

eachnode

ofthe

undirectedportion

ofour

chaingraph.

As

we

haveseen,

thisis

computationally

in-tractable.

Thus,

itseem

sthe

keyproblem

isto

compute

theconditionalm

arginaldistributionoverthe

latentpatchesx

p .In

thefollow

ingsection

we

explainhow

thisis

doneusing

anapproxim

ation.E

venthough

learningcan

beseen

asan

instanceof

in-ference,w

edivide

thissection

into(1)

inferringthe

latentim

age(usually

referredto

asinference)

and(2)

estimating

them

odelparameters

(usuallyreferred

toas

learning).

3.1Inference

andA

pproximate

E-step

We

assume

thereader

havesom

efam

iliarityw

iththe

EM

algorithm(seee.g.,[18,2]).T

heintractability

ofcomputing

theE

-stepexactly

canbe

seenas

follows.

Inourm

odel,theE

-stepis

equivalenttocom

putingthe

posterior:

P(l,t|Y

)∝ ∫X ∏p∈Pp(y

p |lp ,tp )P

(tp |x

p )p(X)dX

,(6)

butwe

cannotsolvethis

integral;thusw

eare

forcedto

findan

approximate

way

toperform

theE

-step.A

pproximating

patchconditional

distributions.Let

ussay

thatΛl andΨ

phave

some

value(w

ecould

initializeΛ

landΨ

pto

arandom

matrix

foreachland

p).T

hen,forevery

pw

ecan

selectthe

setKp

ofK

most

likelytopo-

logicaltransform

ationsgiven

thedictionaryµ.

This

canbe

easilydone

foreach

patchinX

onceP

(tp ,lp |y

p )(see

below)

iscom

putedby

takingthose

topologicaltransfor-m

ationsw

ithhighest

probabilityper

hiddenim

agepatch.

This

takesO(TL

)probability

evaluationsper

patchand

asorting

operationam

ongTelem

ents.T

heapproxim

ation

3more

importantly,

ifthetransform

ationindicestp w

ereknow

n.

isnecessary

forcom

putationalreasons

andcan

bem

adeas

exactas

desiredif

we

arew

illingto

paythe

associatedcom

putationalcost.T

hisapproxim

ationw

asalso

usedin

asim

ilarw

ayin

[5];it

accountsfor

cuttingoff

thelongtails

ofthedistributionP

(tp |y

p )(or

theapproxim

ationto

P(t

p |Y))

byignoring

veryunlikely

topologicaltransfor-

mations.U

singthis

succinctrepresentation,we

canthen

compute

anapproxim

ateE

-step.T

hisis

equivalenttoinferring

theim

agepatchesx

pgiven

theparam

etersso

farestimated

andthe

currentposteriordistributionsovertp .

Note

that(1)Xis

conditionallyindependentofthe

restofthem

odelvariablesgiven

tand

(2)w

ecan

compute

them

arginal-conditionaldistributions

overtp

aloneby

simply

usingP(t

p |yp )

=∑

lpP

(tp ,lp |y

p ).Inferring

thelatentim

age.We

havereduced

ourprob-

lemto

thatof

inferringa

distributionover

thestates

ofX

givena

distributionovert

pfor

allp.

One

way

toperform

thiscom

putationis

byperform

ingloopybeliefpropagation

inthe

MR

FforX

tocom

putethe

posteriorm

arginalsover

eachp(xp ).

Loopybeliefpropagation

accountsforapprox-

imating

p(xp )

byusing

thebelief

propagationm

essagepassing

updatesmi→

j[20]for

severaliterations,which

inour

modelcan

beform

allyw

rittenas

follows:

mi→

j (xj )

=∑xi =

si P

(xi |t

i )ψij (x

i ,xj )

∏k∈N(i)\

j

mk→

i (xi )

bi (x

i )=

P(x

i |ti ) ∏k∈N

(i) mk→

i (xi ),

withN

(i)the

neighborsands

ithe

candidatesforx

i .T

heseupdates

guaranteethat

uponconvergence

them

arginalprobabilitiesp(x

i |Y),

obtainedby

simply

nor-m

alizingbi (x

i ),w

ouldbe

atleast

ata

localminim

umof

thecorrespondingB

ethefree

energy[27]

ofthe

(condi-tional)M

RF.T

hedom

ainofx

pis

inpractice

discretesince

theprobability

distributionis

concentratedonly

atthecan-

didatepatches

givenbyK

lp .T

hus,everyfulliteration

hascom

plexityO(K

2).W

ithknow

ledgeofthe

(approximate)

conditionalmarginals,the

E-step

fromE

q.6is

thengiven

by:

P(t

p ,lp |yp )∝ ∑x

p

p(xp |Y

)P(t

p |xp )P

(lp )p(yp |lp ,t

p ).(7)

Once

thisis

done,we

canthen

performthe

M-step

(asex-

plainednext)

anditerate

theE

Malgorithm

asusual.

Even

thoughloopy

beliefpropagationis

notexact(clearly,sincethis

problemcannotbe

solvedexactly)

andnotguaranteed

toconverge,som

erecentw

orksupports

thisapproxim

ation[8,13,17].O

therapproximations

havealso

beensuggested

(e.g.,[25,12,9]).Inourcase

loopybeliefpropagation

seemto

provideaccurate

posteriormarginals

fortheexperim

entsperform

ednext.

3.2L

earningthe

Translation

Param

etersO

ncew

eperform

edthe

E-step,

theM

-stepoptim

izesthe

expectedvalue

ofthe

model

jointdistribution

underP

(tp ,lp |y

p )w

ithrespect

tothe

modelparam

etersΛ

land

Ψp .

This

canbe

doneby

computing

firstderivatives,

asin

aM

AP

estimate

setting.F

orlinear

transformation

func-tions,w

ecan

obtaina

closed-formsolution

forthe

optimal

Page 5: Unsupervised Image Translation - People

values of the parameters:

Λl =

∑p

∑tpP (tp, lp = l|yp)yp(Γtpµ)�∑

p

∑tp,lp

P (tp, lp|yp)(Γtpµ)(Γtpµ)�(8)

Ψp =1Zp

∑tp

∑lp

P (tp, lp|yp)

(yp − Λlp(Γltµ))(yp − Λlp(Γltµ))�, (9)

with Zp =∑

tp

∑lpP (tp, lp|yp).

For non-linear transformation functions, we need to usenon-linear function optimization, such as gradient meth-ods. In any case, the gradient can be computed efficiently.

In summary, learning/inference can be done exactlyonly up to inferring the marginal distribution of the latentpatches (a problem which is in NP). One way around thisproblem is to approximate these marginal distributions anduse them to update the model parameters.

4 Experimental ResultsWe illustrate the image translation method in diverse tasks.Using our method, these tasks can all be seen as instancesof the same problem. In each task, between four andseven transformationsL were chosen. The dimension-ality of the latent and observed image patches has beenreduced in dimensionality by 25% using Principal Com-ponent Analysis. This is mainly of numerical concernsince even a20 × 20 patch will generate a vector of di-mensionality 400, for which it is hard to compute statis-tics using finite precision computations and small datasets.All the patches considered in these experiments are squarepatches. The overlap between neighboring patches used inEq. 4 to compute the clique potentials is set to four pix-els deep. We use the luminance value (from the YIQ colorspace) as our representation for each pixel (instead of itsRGB values) [10]. For color images, the color compo-nents (IQ) are then simply copied to the final estimated im-age. Most images can be (much) better perceived directlyfrom the computer screen due to resolution / space limita-tions (these and more tests are also available athttp://www.psi.toronto.edu).

4.1 De-blurring / De-noisingIn our first experiment, we strongly down-sampled a pho-tographic image as seen in Fig. 4(left). This will be our ob-served imageY. We used four transformations and15×15patches for both observed and latent image. Our goal is toobtain a higher-resolution version of the observed image.Thus, as our dictionary, we used photographic patches withthe desired resolution (female face photos). The result isshown in Fig. 4(right). The system was able to infer thehigh-resolution patches significantly well given the infor-mation present in the considerably degraded input image(also, see video for MAP estimates at each iteration). Inprevious work’s experimental evaluation, it is rare to seetests performed with this level of degradation in the inputimage. Note that our model is meant to solve the more gen-eral task of learning arbitrary convolution kernelsi.e.,notjust de-blurring kernel.

4.2 Photo-to-Line-Art TranslationNow the goal is to make our input image acquire the line-art attributes of an unrelated source image. For this task we

Figure 4:Input image (left) and inferred image (MAP)(right).

Figure 5: Source line art examples: engraving from Gus-tave Dore’s illustrations ofDon Quixote, stippling by ClaudiaNice[19], andCupid [11].

used patches of size15 × 15 andL = 5. We chose threesource examples, shown in Fig. 5, and applied them to theimage in Fig. 6(left). The results show that our method issuitable for performing this task also. On the positive side,note that even larger scale properties are accurately dis-played by the corresponding inferred images; this providesexcellent line consistency across the image. On the nega-tive side, the algorithm has some difficulty in regions withstrong depth discontinuities, such as the outer face con-tour, perhaps because it does not count with patches withenough detail in the source image.

4.3 Photo-to-Paint Translation

We now apply renowned artistic styles to the input im-ages. We use an image of a well-known painting by vanGogh, shown in Fig. 1(top), as the image with the desiredattributes, and the photo in Fig. 1(center) as the input. Weused L=4 and a30 × 30 patch size. The inferred image inFig. 1(bottom) inherited the local patch statistics and alsoglobal features found in the source painting. We repeatedthe experiment using the same painting but different inputimage; results are shown in Fig. 7. This image also ac-quired the source style. However, Lena’s eyes could not beinferred with enough accuracy (or pleasing artistic detail),perhaps also because of the lack of patches with the cor-rect properties in the original painting. We also employedpaintings with different properties, on example is shown inFig. 8. In order to more carefully observe the details in theimage translation, zoomed-in areas of previous examplesare shown in Fig. 9.

Page 6: Unsupervised Image Translation - People

Figure 6:Input imageY (left) and inferred images using the three source line art example images in Fig. 5 (in same order)

Figure 8:Source imageµ, Cistern in the Park at Chateau Noir by Cezanne (left), input imageY, shore photo by John Shaw[22] (center),inferred imageX (right).

5 Relation to Previous Methods

We now briefly compare our model with several relevantapproaches, proposed to solve more specific problems.The model in [5], was proposed to perform image super-resolution. Using our framework, this approach can be ob-tained by (1) settingL = 1, i.e.,use only one transforma-tion and (2)Λ1 equal to a fixed and known low-pass filter;in [5] there is no need to estimateΛ because it is given. An-other important difference is that the input to this algorithmis different than the input to our algorithm. This model as-sumes that there exist a set of image pairs for training. Eachpair consists of the high and the low resolution version ofan image (the low-res version can be simply generated us-ing the low-pass filter transformation). Our algorithm doesnot assume that we have access to these image pairs fortraining, but that we only have access to one image withthe desired statistics. This a much more difficult computa-tional and modeling task which also has important practi-cal implications: it is more restrictive to have to find a pairof perfectly aligned images, one with the desired statistics(e.g.,style) and the other one obtained in the same mode asthe input image we hold.

The model presented in [10], proposed for non-photorealistic rendering is also, like [5], a model where itis assumed that there exist a set of image pairs for train-ing, with the statistics ofY andX . Λ is not given, but inthe supervised approach taken in this work,Λ can be easilyfound. More specifically,Λ is the equivalent of a nearestneighbor algorithm, which is easy to compute (given thetraining image pair). Large scale consistency was achieved

using also a nearest neighbor approach, which in our modelis equivalent to learning the distribution overX from oursample images and using it to define the MRF energy func-tion (i.e.,effectively replacing Eq. 4)

Inference in [9] can be seen as inference in our modelusing only the MRF sub-graph, with observationsyp di-rectly linked to unobserved nodesxp. Also, instead of be-lief propagation, in [9] Monte Carlo techniques were pro-posed to infer the hidden state of the image patchesxp.

6 ConclusionsThe image translation approach proposed here provides ageneral formalism for the analysis of a variety of problemsin image processing where the goal is to estimate an unob-served image from another (observed) one. These includea number of fundamental problems previously approachedusing separate methods. In this sense, the image transla-tion approach can be of practical and theoretical interest.Several practical extensions are possible, for example inthe choice of a different approximate inference method, inthe choice of clique potential functions used in the MRF,or in the form of the patch transformations. These changesare likely to be application dependent, and can be easilyincorporated within the framework presented here.

AcknowledgmentsWe thank Aaron Hertzmann for helpful discussions onnon-photorealistic rendering and for evaluation of results.

Page 7: Unsupervised Image Translation - People

Figure 7:Input imageY (top), inferred latent imageX (bottom).

Figure 9:Two zoomed-in translations from previous examples:source image (real painting) (left) and inferred image (right); in-put images (not shown here) were the example landscape photos.

References[1] I. Csiszar and G. Tusnady. Information geometry and alternat-

ing minimization procedures.Statistics and Decisions, 1:205–237,1984.

[2] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood estima-tion from incomplete data.Journal of the Royal Statistical Society(B), 39(1), 1977.

[3] A. Efros and W. Freeman. Quilting for texture synthesis and trans-fer. In International Conference on Computer Vision, 1999.

[4] A. Efros and W. Freeman. Texture synthesis by non-parametricsampling. InSIGGRAPH, 2001.

[5] B. Freeman, E. Pasztor, and O. Carmichael. Learning low-levelvision. IJCV, 2000.

[6] B. Frey. Graphical Models for Machine Learning and Digital Com-munication. MIT Press, 1998.

[7] B. Frey and N. Jojic. Learning graphical models of images, videosand their spatial transformations. InUncertainty in Artificial Intel-ligence, 2000.

[8] B. Frey and D. MacKay. A revolution: Belief propagation in graphswith cycles. InAdvances in Neural Information Processing Systems,volume 10. The MIT Press, 1998.

[9] D Geman and S Geman. Stochastic relaxation, gibbs distributionand bayesian restoration of images.IEEE Trans. PAMI, 6(6):721–741, 1984.

[10] A. Hertzmann, C. Jacobs, N. Oliver, B. Curless, and D. Salesin. Im-age analogies. http://www.mrl.nyu.edu/projects/image-analogies/.In SIGGRAPH 2001., pages 327–340, 2001.

[11] A. Hertzmann and D. Zorin. Illustrating smooth surfaces. InSIG-GRAPH 2000., pages 517–526, 2001.

[12] M. Jordan, Z. Ghahramani, T.Jaakkola, and L. Saul. An introductionto variational methods for graphical models.Learning in GraphicalModels, M. Jordan (editor), 1998.

[13] F. Kschischang and B. Frey. Iterative decoding of compound codesby probability propagation in graphical models.IEEE J. SelectedAreas in Communications, 16:219–230, 1998.

[14] F. Kschischang, B. Frey, and H. Loeliger. Factor graphs and thesum-product algorithm.IEEE Transactions on Information Theory,2001.

[15] S. L. Lauritzen and N. Wermuth. Graphical models for associationsbetween variables, some of which are qualitative and some quanti-tative. Annals of Statistics, 17:31–57, 1989.

[16] S. Li. Markov Random Field Modeling in Computer Vision.Springer-Verlag, 1995.

[17] R. McEliece, D. MacKay, and J. Cheng. Turbo decoding as an in-stance of pearl’s belief propagation algorithm.IEEE J. SelectedAreas in Communications, 16, 1998.

[18] R. Neal and G. Hinton. A view of the em algorithm that justifies in-cremental, sparse, and other variants.Learning in Graphical Mod-els, M. Jordan (editor), 1998.

[19] C. Nice. Sketching Your Favorite Subjects in Pen and Ink. F-WPublications, 1993.

[20] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan-Kaufman, 1988.

[21] A. Pentland and B. Horowitz. A practical approach to fractal-basedimage compression.Digital Images and Human Vision, 1993.

[22] F. Petrie and J. Shaw.The Big Book of Painting Nature in Water-color. Watson-Guptill, 1990.

[23] J. Portilla and E. Simoncelli. Statistics of complexwavelet coeffi-cients.International Journal of Computer Vision, 40(1), 2000.

[24] R. Schultz and R. Stevenson. A bayesian approach to image expan-sion for improved definition.IEEE Transactions on Image Process-ing, 3(3):233–242, May 1994.

[25] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upperbounds on the log partition function. InUncertainty in ArtificialIntelligence, 2002.

[26] Y. Weiss. Interpreting images by propagating bayesian beliefs.In Advances in Neural Information Processing Systems, volume 9,page 908, 1997.

[27] J. Yedidia. An idiosyncratic journey beyond mean field theory.Ad-vanced Mean Field Methods, pages 21–36, 2001.

[28] S. Zhu and D. Mumford. Prior learning and gibbs reaction-diffusion. IEEE Trans. PAMI, 1997.