dualsr: zero-shot dual learning for real-world super-resolution · 2020. 12. 18. · mohammad emad...

DualSR: Zero-Shot Dual Learning for Real-World Super-Resolution

Mohammad Emad

Eindhoven University of Technology

Eindhoven, The Netherlands

[email protected]

Maurice Peemen

Thermo Fisher Scientific


[email protected]

Henk Corporaal

Eindhoven University of Technology


[email protected]

Abstract

Advanced methods for single image super-resolution

(SISR) based upon Deep learning have demonstrated a re-

markable reconstruction performance on downscaled im-

ages. However, for real-world low-resolution images (e.g.

images captured straight from the camera) they often gen-

erate blurry images and highlight unpleasant artifacts. The

main reason is the training data that does not reflect the

real-world super-resolution problem. They train the net-

work using images downsampled with an ideal (usually

bicubic) kernel. However, for real-world images the degra-

dation process is more complex and can vary from image to

image. This paper proposes a new dual-path architecture

(DualSR) that learns an image-specific low-to-high resolu-

tion mapping using only patches of the input test image. For

every image, a downsampler learns the degradation pro-

cess using a generative adversarial network, and an up-

sampler learns to super-resolve that specific image. In the

DualSR architecture, the upsampler and downsampler are

trained simultaneously and they improve each other using

cycle consistency losses. For better visual quality and elim-

inating undesired artifacts, the upsampler is constrained by

a masked interpolation loss. On standard benchmarks with

unknown degradation kernels, DualSR outperforms recent

blind and non-blind super-resolution methods in term of

SSIM and generates images with higher perceptual qual-

ity. On real-world LR images it generates visually pleasing

and artifact-free results.

1. Introduction

The aim of Single Image Super-Resolution (SISR) is

to upsample a low-resolution (LR) image and reconstruct

the high-resolution (HR) details. Recently, these super-

Bicubic SAN+ [7] KernelGAN [2] + ZSSR [26]

DualSR (ours)

Figure 1: Example of 2x SR applied to a real-world image from

RealSR [5] dataset.

resolution (SR) methods have entered our daily life by aid-

ing low end smartphone cameras [31, 23]. Furthermore the

restoration of historical LR photos to clean HR results is

performed by novel SISR methods. Even old movies are

converted to high-definition video quality. Next to the me-

1630

dia industry these SR techniques have important other ap-

plications in medical imaging [33, 28], remote sensing [12],

microscopy [10], surveillance [22] and so on.

The introduction of Convolutional Neural Networks

(CNNs) have revolutionized computer vision and image

processing techniques such as super-resolution. Many re-

cently introduced SR methods based upon deep-learning,

e.g. [8, 18, 24, 19, 30, 37, 7], learn the complicated LR-

HR upsampling relations on huge datasets. Compared to

traditional earlier methods these provide a significantly im-

proved HR result. However, these pre-trained DL methods

often perform much worse on captured images straight from

a camera. They are trained on clean, noise-free, syntheti-

cally generated LR images, while the degradation process

for real-world LR images is different from the ideal condi-

tions. To a large extent, this is due to the supervised training

scheme that is not representative for the real-world problem.

Additionally, each cameras differs in acquisition parameters

such as the Point Spread Function (PSF) of the sensor. Even

images captured by the same camera will differ because of

different light conditions, depth of field, blur due to shaking

and so on. These conditions make it intractable to train a

single CNN that performs well on all different image degra-

dation conditions.

Blind SR methods solve the super-resolution problem

with less assumptions on the degradation process. Often

these assume that the LR image is the result of a down-

sampled HR image, convolved with a blur kernel k, and

added noise n:

ILR = (IHR ∗ k) ↓s +n (1)

Many blind SR methods estimate the degradation pro-

cess before they perform a parameterized super-resolution

operation. The state-of-the-art techniques [2, 11, 3] use

deep learning to learn an image-specific downsampler

(degradation model parameters) that is used by the upsam-

pler to super-resolve the input LR image. However, esti-

mating a proper downsampler from a single input image is

complicated. Especially in the presence of noise or other ac-

quisition artifacts these methods often fail to estimate good

degradation parameters. A wrong degradation severely re-

duces the effectiveness of the upsampler, and reduces the

SR performance.

With the true downsampler one can determine the up-

sampler more accurately. On the other hand, with the

true upsampler, one can correctly estimate the downsam-

pler. In other words, the upsampler and downsampler are

the inverse of each other and improving one can also im-

prove the other. This relation motivated us to simultane-

ously train both the upsampler and downsampler in a single

pipeline. Inspired by recent unsupervised methods such as

CycleGAN [38] and DualGAN [32], we introduce DualSR,

a dual-path architecture for super-resolution on real-world

LR images. In DualSR, the downsampler learns the patch

distribution of the input image using a KernelGAN based

kernel estimation [2]. However, unlike KernelGAN, the up-

sampler and downsampler are trained simultaneously and

improve each other using cycle consistency losses. The up-

sampler is constrained by a novel masked interpolation loss

that gives better visual quality and eliminates undesired ar-

tifacts in the output result. On every new input image, the

networks are trained from scratch using only patches of the

new input image. We evaluate our method on existing syn-

thesized and real-world benchmarks that shows how Du-

alSR outperforms recent SR methods. Figure 1 compares

the output of different methods on a real-world LR image.

In summery, the contributions of this work are three-

fold:

• Inspired by [38, 32], we propose a dual-path architec-ture optimized for blind super-resolution that can be

trained in reasonable time using only the input LR im-

age.

• We introduce a new masked interpolation loss that sub-stantially reduces oversharpening and suppresses un-

wanted artifacts in the output image.

• We evaluate our method on existing synthetically gen-erated and real-world benchmarks and we compare to

the state-of-the-art blind and non-blind SR methods.

2. Related work

Super-resolution on LR images with an unknown degra-

dation process is not completely new. Before the advent

of deep learning, several learning-based methods [21, 29,

13, 14] have been introduced to address this problem. Very

recently, some deep learning based methods for blind and

non-blind SR have been proposed. Non-blind SR assumes

the degradation process is known beforehand. For exam-

ple, ZSSR [26] trains an image-specific CNN by using the

recurrence of small patches across different scales in a sin-

gle image. SRMD [36] uses dimensionality stretching that

enables a convolutional SR network to take degradation pa-

rameters (i.e. blur kernel and noise level) as input. A grid

search strategy is used to find the best configuration of these

parameters. USRNet [35] is another non-blind SR method

that tries to alternatively solve a data subproblem and a prior

subproblem under the MAP (maximum a posteriori) frame-

work.

On the other hand, Blind methods try to estimate the

degradation process before upsampling. IKC [11] proposes

an iterative correction scheme for blur kernel estimation

and uses Spatial Feature Transform (SFT) layers to handle

multiple blur kernels. KernelGAN [2] estimates the image-

specific blur kernel by internal learning of a Generative Ad-

versarial Network (GAN) using only LR test image. The

1631

fake

real

Interpolation lossForward cycle:

Backward cycle:

GUP GDN

½x

2x1x1x

1x 1x

GDN(GUP(x)) ≈ xGUP(x))

x

GDN GUP

DDN1x

yGDN(y))

GUP(GDN(y)) ≈ y

1xx

Figure 2: The network architecture of the proposed DualSR. GUP is the upsampler, GDN is the downsampler and DDN is the discrimina-

tor. The top dataflow represents the forward cycle where we apply the upsampler before downsampler and the bottom part represents the

backward cycle where upsampler is applied after downsampler.

estimated kernel can be used by non-blind methods such

as ZSSR and USRNet to obtain super-resolved output im-

age. Cornillère et al. [6] propose BlindSR that estimates the

degradation setting by analysing the artifacts generated in

the super-resolved HR output. They train a kernel discrim-

inator that predicts errors present due to wrong kernel esti-

mation and then they recover the correct kernel by minimiz-

ing the discriminator error. Very recently, based upon the

generalized sampling theory, Hussein et al. [16] proposes a

closed-form derivation of their correction filter to transform

the degraded LR image such that it matches the bicubic

downsampling result. Then, the modified LR image (sim-

ilar to bicubic downsampling) is upsampled using existing

state-of-the-art DNNs trained on bicubically downsampled

images. In case of isotropic degradations, the transforma-

tion to bicubic gives acceptable results. However, for more

complicated (non-isotropic) degradations the results are not

reported.

Two similar works, CycleGAN [38] and DualGAN [32]

introduce an effective architecture for the image-to-image

translation where paired images are not available. They

connect the main translator (generator) G:X→Y with its in-verse translator F:Y→X. In addition, to reduce the solutionspace, they introduce cycle consistency loss that encourages

F(G(x)) ≈ x and G(F(y)) ≈ y. Recently, new works haveproposed similar structures for blind SR. Bulat et al. [3]

propose a two-stage pipeline with High-to-Low and Low-

to-High GANs. A cycle-in-cycle network structure is intro-

duced in [34]. This model first maps the noisy and blurry in-

put to cleaned-up LR space, next a pre-trained SR network

is used to upsample the clean LR image. GAN-CIRCLE

[33] uses a similar structure as CycleGAN, however the

application is super-resolution for Computed Tomography

(CT) images. All these CycleGAN-based methods need un-

paired LR and HR datasets for training. As we explained,

even for images from a single camera, the degradation pro-

cess may be different and collecting these unpaired datasets

is not always possible. This is in contrast to our method that

does not require any other data than a test image.

3. Proposed method

We propose DualSR, a new blind SR method, that super-

resolves real-world LR images with different acquisition

processes (different downsampling kernels). It is a dual-

path pipeline inspired by CycleGAN and DualGAN that can

be trained in reasonable time using only patches of the LR

input image. Figure 2 illustrates the proposed DualSR ar-

chitecture. In this figure, GUP is the upsampler that trains to

upsample the specific LR image. GDN is the downsampler

that learns the degradation process. It aims to downsample

the input LR image such that, at patch level, the distribution

of downsampled image is as close as possible to the input

image itself. It is shown in [2] that this internal distribu-

tion of patches can be learned using an internal GAN [25]

trained on patches of the input image (LGAN in equation 2).DDN is the discriminator that learns to distinguish between

real (patches of the input LR image) and GDN generated

output patches.

1632

Ideally, the upsampler and downsampler should be the

inverse of each other. This implies that GDN (GUP (x)) =x and GUP (GDN (y)) = y are valid. These conditions aredemonstrated in figure 2 as forward and backward cycles

and we refer to their loss functions as forward and back-

ward cycle consistency losses (Lcycle). In the forward cy-cle, we first apply the upsampler to generate a 2x upsampled

image. Then the downsampler is applied and converts the

upsampled image back to 1x. Similarly, in backward cycle,

at first a 1/2x downsampled version of the input is generated

by GDN , and then GUP upsamples the image back to the

original scale. These two cycle consistency losses ensure

that GUP and GDN can revert the operation done by the

other.

In addition to the cycle consistency losses, the upsam-

pler is constrained by a novel masked interpolation loss

(Linterp) that is applied between the input patch x and its 2xupsampled image GUP (x). This loss function encouragesthe upsampler to preserve the color composition of its in-

put and eliminates unpleasant artifacts without making the

output image blurry. The full loss function for training the

upsampler and downsampler is:

Ltotal = LGAN + λcycleLcycle + λinterpLinterp (2)

Where λ values control the importance of the different loss

function terms.

3.1. Adversarial loss

Accurate degradation model estimation is essential for

blind SR. The importance of an accurate estimate of the

blur kernel is significantly larger than that of a sophisticated

prior (pre-training on an external dataset) [9]. To find the

blur kernel, we use the idea of KernelGAN and embed it in

our dual-path architecture. It estimates the image-specific

degradation kernel using a Generative Adversarial Network

(GAN) which tries to preserve the distribution of patches

across scales of the LR image. The generator GDN is a

model of the degradation process and downsamples the in-

put patch y such that, the output GDN (y) is indistinguish-able by the discriminator from small input patches. We de-

fine the adversarial loss for the generator as:

LGAN = Ey[DDN (GDN (y))− 1]2 +R (3)

Where the regularization term R applies realistic explicitpriors on the estimated kernel (explained in [2]). On the

other hand, the discriminator tries to distinguish fake im-

ages generated by GDN from real patches of input LR im-

age and its objective is:

LD = Ex[(DDN (x)− 1)2] + Ey[DDN (GDN (y))

2] (4)

We also experimented with an adversarial loss for the up-

sampled image GUP (x) by adding a DUP discriminator.However, without example HR images, it was not helpful

and resulted in undesired artifacts in the output image.

(b)

Uniform interp

(a)

w/o interp

(c)

Masked interp

(d)

fmask

Figure 3: Effect of interpolation loss on the output. (a) SR result

without interpolation loss. (b) SR result with uniform interpola-

tion loss: ‖GUP (x) − Bicubic(x)‖1. (c) SR result with maskedinterpolation loss represented in equation 7. (d) Frequency mask

generate by equation 6.

3.2. Cycle consistency loss

Both forward and backward cycle consistency losses

play an essential role in training of DualSR. The forward

cycle facilitates the training convergence of the downsam-

pler GDN and also forces the upsampler GUP to generate

images invertible by GDN . In the backward cycle, the up-

sampler learns to reconstruct the input LR image from its

downsampled version generated by GDN . Learning LR-

HR relations from the downsampled version of the input

image was firstly introduced by ZSSR [26], but in DualSR,

it is implicitly part of the backward cycle loss. The final cy-

cle consistency loss is the sum of the forward and backward

losses:

Lcycle = Ex‖GDN (GUP (x))− x‖1

+ Ey‖GUP (GDN (y))− y‖1(5)

3.3. Masked interpolation loss

Because there is no direct supervision for training GUP ,

ringing effects often occur around sharp edges of the out-

put image. In addition, unwanted artifacts show up espe-

cially in the low-frequency areas of the output image. This

problem is even more severe when GDN is not represent-

ing an accurate estimate of degradation model. Figure 3(a)

shows the output of DualSR without interpolation loss when

the estimated degradation kernel slightly differs from the

ground-truth kernel. To eliminate these artifacts, we intro-

duce a weighted cost function that minimizes the difference

between a bicubically upsampled image and the output of

GUP .

It is well-known that bicubic interpolation correctly up-

samples low-frequency areas, however it does not recon-

struct high-frequency details. Hence, applying a uniform

interpolation cost to all pixels generates an artifact-free

but blurry result (see figure 3(b)). To avoid blurriness,

we only apply an interpolation cost to low-frequency parts

of the image. For this purpose, we generate a frequency

mask (fmask) by applying Sobel operator to the bicubic-

1633

upsampled image:

fmask = 1− Sobel(Bicubic(x)) (6)

This mask has higher values for pixels in low-frequency ar-

eas and lower values for pixels in high-frequency areas of

the image (see figure 3(d)). We define masked interpolation

loss as:

Linterp = Ex‖[GUP (x)−Bicubic(x)]× fmask‖1 (7)

It encourages GUP to follow bicubic interpolation in only

low-frequency areas of the image. As it is shown in fig-

ure 3(c), it generates images with sharp edges without

adding artifacts.

4. Implementation

Network configuration Unlike CycleGAN, our DualSR

generators (GUP and GDN ) have different network archi-

tectures. For a single image the LR-HR conversions can

be performed by small networks unlike the huge networks

that train on large datasets. For the upsampler, similar to

ZSSR, a simple 8-layer fully convolutional network with

ReLU activations is employed. There is a global residual

connection between the input and output [18, 26]. We up-

scale the LR image to the output size before feeding it into

the network. The network architecture from KernelGAN is

used for downsampling and the corresponding discrimina-

tor. The generator used for downsampling is a deep linear

network (without any activations) and the discriminator is

a fully convolutional PatchGAN [17] with a receptive field

of 7x7. The small receptive field enforces to use only lo-

cal features (e.g. edges) of the LR image, instead of relying

on high-level global features. Hence, the generator GDNlearns the kernel that generates images with a patch distri-

bution similar to the input LR image.

Training details We train all networks GUP , GDN and

DDN from scratch for every input image. In each iter-

ation, we train generators and discriminator successively,

each time with a batch of two patches from the LR input.

The patches are 64x64 and 128x128 (patches x and y in fig-

ure 2). Since SR kernels are not always symmetric, no geo-

metric transformation is applied during training or test time.

We tuned hyper parameters in equation 2 with a grid search

strategy. We changed λcycle from 0 to 7.5 and λinterp from

0 to 4 and calculated the average PSNR for the first 10 im-

ages in DIV2KRK [2] dataset. The results are demonstrated

in figure 4. It shows that the best performance happens for

λcycle = 5 and λinterp = 2. We train the networks for 3000iterations with ADAM optimizer. The initial learning rate

is 0.001 for GUP and 0.0002 for GDN and DDN and both

are decreased every 750 iterations. The final super-resolved

image is obtained by running the trained upsampler on the

LR input image.

0.0 2.5 5.0 7.5

cycle

30.0

30.5

31.0

31.5

32.0

PSN

R (d

b)

0 1 2 3 4

interp

31.0

31.2

31.4

31.6

31.8

Figure 4: Performance analysis of DualSR with different values

for λcycle and λinterp. For the left plot, we set λinterp = 2 andchange λcycle from 0 to 7.5. For the right plot, we set λcycle = 5and change λinterp from 0 to 4. PSNR values are calculated on

the first 10 images in DIV2KRK [2] dataset.

Run-time Since we train DualSR on fixed-size patches

cropped from the input image, the training time is almost

independent of image size. The average training+inference

time for our method is 233 seconds on an RTX 2080 Ti

GPU. For the combination of KernelGAN [2] + ZSSR [26]

the run-time is 281 seconds and for BlindSR [6] it is 370

seconds. Supervised deep learning SR methods like SAN

[7] have a very long training time and image size signifi-

cantly influences their inference time. For SAN+, it takes

298 seconds, on average, to super-resolve each image of

DIV2KRK benchmark [2].

5. Experiments and results

DualSR is designed to super-resolve real-world LR im-

ages. However, these images have no ground-truth, so these

do not enable a quantitative evaluation. Hence, we use

synthetically generated LR images with unknown degra-

dation settings (from DIV2KRK [2], Urban100 [15] and

NTIRE2017 [27] benchmarks) to compare DualSR against

state-of-the-art methods. In addition, we experiment on

real-world LR images (from RealSR [5] dataset) and com-

pare the results qualitatively. In the end, we perform an

ablation study to investigate the influence of each loss term

in the overall loss function.

5.1. Evaluation on synthesized images

In this section, we evaluate DualSR on three bench-

marks that simulate real-world LR images. The first one is

DIV2KRK [2] benchmark which uses DIV2K [1] validation

set (100 high-quality images) as HR ground-truth. It con-

structs the LR images by convolving an 11x11 anisotropic

Gaussian blur kernel to the image before downsampling.

Kernels vary for every image and have different shape and

orientation. A uniform multiplicative noise is also applied

to each kernel. We use the same degradation approach to

generate LR images from Urban100 [15] dataset. As the

1634

Category Method DIV2KRK [2] Urban100 [15] NTIRE2017 [27]

Bicubic Interpolation 28.73 / 0.8040 23.32 / 0.6859 27.72 / 0.7689

1st Bicubic kernel + ZSSR [26] 29.10 / 0.8215 23.78 / 0.7143 27.80 / 0.7726

(Bicubically trained) EDSR+ [20] 29.17 / 0.8216 23.01 / 0.6857 27.78 / 0.7720

SAN+ [7] 29.21 / 0.8232 23.81 / 0.7153 27.78 / 0.7721

2nd

(Blind methods)

KernelGAN [2] + ZSSR 30.36 / 0.8669 24.66 / 0.7706 27.53 / 0.7572

KernelGAN + ZSSR (masked interp) 30.40 / 0.8595 24.71 / 0.7692 28.04 / 0.7808

KernelGAN + USRNet [35] 27.94 / 0.8084 21.81 / 0.6827 23.84 / 0.6662

BlindSR [6] 31.36 / 0.8720 25.18 / 0.7742 28.35 / 0.7931

DualSR (ours) 30.92 / 0.8728 25.04 / 0.7803 28.82 / 0.8045

3rd

(Oracle kernel)

GT kernel + ZSSR 32.44 / 0.8955 26.38 / 0.8252 -

GT kernel + USRNet 32.23 / 0.8981 24.89 / 0.7821 -

GT kernel + DualSR 32.72 / 0.9030 26.04 / 0.8122 -

Table 1: Quantitative results (PSNR / SSIM) for 2x SR on different datasets. For comparison to other works, the PSNR and SSIM values

of the Y channel are reported. The best performance numbers are highlighted in red and the second best numbers are highlighted in blue.

HRBicubic SAN+ [7]KernelGAN[2] + ZSSR [26] DualSR (ours)LR input

DIV

2K

RK

exam

ple

s [2

]

KernelGAN[2] + USRNet [35] BlindSR [6]

NT

IRE

2017 e

xam

ple

s [2

7]

Urb

an100 e

xam

ple

s [1

5]

Figure 5: Qualitative comparison for 2x SR on synthetically generated datasets.

1635

HRBicubic SAN+ [7]KernelGAN[2] + ZSSR [26] DualSR (ours)LR input

KernelGAN[2] + USRNet [35] BlindSR [6]

Figure 6: Qualitative comparison for 2x SR on real-world images (RealSR dataset).

third benchmark, we use the track 2 dataset of NTIRE2017

SISR challenge [27]. Similar to DIV2KRK, this benchmark

uses DIV2K validation set as HR ground-truth and is differ-

ent in synthesizing the LR counterparts. The downsampling

operator is more complex and unknown for this benchmark.

Table 1 reports the PSNR and SSIM values. There are

three categories of SR results in this table. The first category

are SR methods which are trained on images downsampled

with bicubic interpolation. To the best of our knowledge,

SAN+ [7] is state-of-the-art in this category. For compar-

ison, we also report the result of ZSSR [26] without ker-

nel estimation (ZSSR using bicubic kernel as downsam-

pler). Methods in this category are evaluated on a differ-

ent degradation process than bicubic and due to their in-

flexibility these methods perform poorly. The second cate-

gory contains blind SR results. For non-blind methods like

ZSSR and USRNet [35], we use the blur kernel estimated

by KernelGAN [2]. In contrast, BlindSR [6] and our Du-

alSR methods have an integrated degradation kernel esti-

mator. Finally, in the third category, the potential of pre-

vious methods is evaluated by providing the ground-truth

blur kernels. There are no NTIRE2017 numbers, since the

ground-truth kernel is not available.

According to table 1, the combination of KernelGAN

and ZSSR performs better on DIV2KRK and Urban100

than methods from the first category. However, for the

NTIRE2017 benchmark the combination of KenelGAN and

ZSSR provides lower quality. This is mainly due to the

degradation process that is more complex and is not well

estimated by KernelGAN. Integrating our novel masked in-

terpolation loss into the ZSSR architecture improves the

results for NTIRE2017 substantially even if the estimated

kernel is not accurate. We observe that USRNet performs

well when a GT kernel is provided, but for realistic ker-

nel estimation scenarios (kernel estimated by KernelGAN)

the performance is inferior to other SR methods. Finally,

BlindSR and DualSR methods provide best performance re-

sults. BlindSR assumes that the blur kernel is convolution of

classic filters with an anisotropic Gaussian kernel. As a re-

sult, it provides the best PSNR numbers for DIV2KRK and

Urban100 benchmarks where the blur kernel is Gaussian.

However, for NTIRE2017 benchmark, the degradation pro-

cess is not Gaussian so here DualSR outperforms BlindSR

in both metrics. In addition, DualSR produces images with

better perceptual quality and it outperforms other methods

on all datasets in terms of SSIM.

The qualitative comparison of different SR methods is

shown in Figure 5. Note that bicubic interpolation and

SAN+ tends to produce blurry and oversmoothed images.

On the other hand, the results of KernelGAN+ZSSR and

KernelGAN+USRNet are oversharpended and contain se-

vere ringing artifacts around edges and in smooth areas of

the image. Thanks to the masked interpolation loss and im-

proved kernel estimation, these artifacts are not present in

the DualSR results. In comparison with BlindSR, DualSR

generates sharper images with better visual quality.

1636

5.2. Evaluation on real data

After experiments on synthesized images, we evaluate

our method on real-world LR images. For this purpose,

we use the new RealSR [5] dataset which is used in the

NTIRE2019 competition [4]. This dataset contains raw im-

ages captured by DSLR cameras. Multiple images of the

same scene have been captured with different focal lengths.

Images taken by longer focal lengths contain finer details

and can be considered as HR counterparts for images with

shorter focal lengths. Although RealSR provides images

on different scales, it is really hard to obtain image pairs

that are totally aligned. That is because of complicated mis-

alignment between images and changes in the imaging sys-

tem introduced by adjusting the focal length. As a result,

we only consider visual comparison of SR results. We use

images captured with 28mm focal length as LR inputs and

images taken at 50mm focal length as HR counterparts.

Figure 1 and figure 6 show the visual comparison of 2x

upsampling on RealSR dataset. The degradation process

is unknown and complicated for the LR images. There-

fore, bicubic and SAN+ produce blurry results. Note that

the ringing artifacts produced by KernelGAN+ZSSR and

KernelGAN+USRNet are even more severe than in figure 5

and it shows that for real-world images KernelGAN does

not find the correct kernel. Even BlindSR cannot estimate

the degradation process correctly and produces blurry re-

sults. That is because, for real-world images, the degrada-

tion process is more complicated than a simple anisotropic

Gaussian kernel. In contrast, DualSR produces artifact-free

photo-realistic natural images.

5.3. Ablation study

To study the contribution of each term in the loss func-

tion for our dual-path architecture, we compare DualSR

with ablations of the full version. Figure 7 illustrates visual

comparison of different structures on two real-world images

and table 2 shows PSNR and SSIM values on DIV2KRK

benchmark. At first, we evaluate our method in the absence

of masked interpolation loss. This model suffers from over-

sharpening and intense artifacts in the output image. Next

we remove forward and backward cycle consistency losses

in separate experiments (in the presence of interpolation

loss). In both cases, the results are not as sharp as the pro-

posed full DualSR output. Having cycle consistency losses

and interpolation loss combined together in the objective

function, makes DualSR capable of generating realistic nat-

ural images without unwanted artifacts.

6. Conclusion

Supervised deep learning methods cannot perform well

when there is a mismatch between training and test data.

This if very problematic for real-world SR problem where

w/o masked interp loss

w/o forward cycle loss

w/o backward cycle loss

DualSR full loss

Figure 7: 2x SR results on real-world images (RealSR) generated

by different variations of DualSR.

Method PSNR SSIM

DualSR w/o masked interp loss 30.15 0.8658

DualSR w/o forward cycle loss 30.33 0.8505

DualSR w/o backward cycle loss 30.16 0.8495

DualSR (final) 30.92 0.8728

Table 2: Quantitative comparison between different variations of

DualSR on DIV2KRK dataset.

the acquisition process varies for every image. We pro-

posed DualSR, a small dual-path architecture, which learns

per image the specific LR-HR relations. It consists of a

downsampler and upsampler that improve each other dur-

ing training using cycle consistency losses. In addition, we

introduced a new masked interpolation loss that removes

artifacts from low-frequency areas of the image without

smoothing the edges. Experimental results demonstrate a

significant improvement over state-of-the-art SR methods.

Not relying on external datasets makes DualSR very adap-

tive to various conditions, e.g. you do not have to retrain on

a large dataset when your camera lens settings change. This

flexibility is very important when the degradation process

varies a lot. Our future work will aim to extend DualSR

to larger scale factors and harder use-cases with more ex-

treme degradation settings. Example applications could be

in the domain of medical imaging or electron microscopy.

In these domains a super-resolution model should not hal-

lucinate but still generate a good high-resolution image.

References

[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge

on single image super-resolution: Dataset and study. In Pro-

ceedings of the IEEE Conference on Computer Vision and

Pattern Recognition Workshops, pages 126–135, 2017.

[2] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind

super-resolution kernel estimation using an internal-gan. In

Advances in Neural Information Processing Systems, pages

284–293, 2019.

1637

[3] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To

learn image super-resolution, use a gan to learn how to do

image degradation first. In Proceedings of the European con-

ference on computer vision (ECCV), pages 185–200, 2018.

[4] Jianrui Cai, Shuhang Gu, Radu Timofte, and Lei Zhang.

Ntire 2019 challenge on real image super-resolution: Meth-

ods and results. In Proceedings of the IEEE Conference on

Computer Vision and Pattern Recognition Workshops, 2019.

[5] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei

Zhang. Toward real-world single image super-resolution: A

new benchmark and a new model. In Proceedings of the

IEEE International Conference on Computer Vision, pages

3086–3095, 2019.

[6] Victor Cornillère, Abdelaziz Djelouah, Wang Yifan, Olga

Sorkine-Hornung, and Christopher Schroers. Blind image

super resolution with spatially variant degradations. ACM

Transactions on Graphics (proceedings of ACM SIGGRAPH

ASIA), 38(6), 2019.

[7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and

Lei Zhang. Second-order attention network for single im-

age super-resolution. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 11065–

11074, 2019.

[8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou

Tang. Image super-resolution using deep convolutional net-

works. IEEE transactions on pattern analysis and machine

intelligence, 38(2):295–307, 2015.

[9] Netalee Efrat, Daniel Glasner, Alexander Apartsin, Boaz

Nadler, and Anat Levin. Accurate blur models vs. image pri-

ors in single image super-resolution. In Proceedings of the

IEEE International Conference on Computer Vision, pages

2832–2839, 2013.

[10] Linjing Fang, Fred Monroe, Sammy Weiser Novak, Lyn-

dsey Kirk, Cara R Schiavon, B Yu Seungyoon, Tong

Zhang, Melissa Wu, Kyle Kastner, Yoshiyuki Kubota, et al.

Deep learning-based point-scanning super-resolution imag-

ing. bioRxiv, page 740548, 2019.

[11] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong.

Blind super-resolution with iterative kernel correction. In

Proceedings of the IEEE conference on computer vision and

pattern recognition, pages 1604–1613, 2019.

[12] Juan Mario Haut, Ruben Fernandez-Beltran, Mercedes E

Paoletti, Javier Plaza, Antonio Plaza, and Filiberto Pla. A

new deep generative network for unsupervised remote sens-

ing single-image super-resolution. IEEE Transactions on

Geoscience and Remote Sensing, 56(11):6792–6810, 2018.

[13] He He and Wan-Chi Siu. Single image super-resolution us-

ing gaussian process regression. In CVPR 2011, pages 449–

456. IEEE, 2011.

[14] Yu He, Kim-Hui Yap, Li Chen, and Lap-Pui Chau. A soft

map framework for blind super-resolution image reconstruc-

tion. Image and Vision Computing, 27(4):364–373, 2009.

[15] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single

image super-resolution from transformed self-exemplars. In



[16] Shady Abu Hussein, Tom Tirer, and Raja Giryes. Correction

filter for single image super-resolution: Robustifying off-the-

shelf deep super-resolvers. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recognition,

pages 1428–1437, 2020.

[17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A

Efros. Image-to-image translation with conditional adver-

sarial networks. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 1125–1134,

2017.

[18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate

image super-resolution using very deep convolutional net-

works. In Proceedings of the IEEE conference on computer

vision and pattern recognition, pages 1646–1654, 2016.

[19] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,

Andrew Cunningham, Alejandro Acosta, Andrew Aitken,

Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-

realistic single image super-resolution using a generative ad-

versarial network. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 4681–4690,

2017.

[20] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and

Kyoung Mu Lee. Enhanced deep residual networks for single

image super-resolution. In Proceedings of the IEEE confer-

ence on computer vision and pattern recognition workshops,

pages 136–144, 2017.

[21] Tomer Michaeli and Michal Irani. Nonparametric blind

super-resolution. In Proceedings of the IEEE International

Conference on Computer Vision, pages 945–952, 2013.

[22] Pejman Rasti, Tonis Uiboupin, Sergio Escalera, and Gholam-

reza Anbarjafari. Convolutional neural network super resolu-

tion for face recognition in surveillance monitoring. In Inter-

national conference on articulated motion and deformable

objects, pages 175–184. Springer, 2016.

[23] Yaniv Romano, John Isidoro, and Peyman Milanfar. Raisr:

Rapid and accurate image super resolution. IEEE Transac-

tions on Computational Imaging, 3(1):110–125, 2016.

[24] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,

Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan

Wang. Real-time single image and video super-resolution

using an efficient sub-pixel convolutional neural network. In



[25] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani.

Ingan: Capturing and retargeting the ”dna” of a natural im-

age. In The IEEE International Conference on Computer

Vision (ICCV), 2019.

[26] Assaf Shocher, Nadav Cohen, and Michal Irani. Zero-shot

super-resolution using deep internal learning. In Proceed-

ings of the IEEE Conference on Computer Vision and Pattern

Recognition, pages 3118–3126, 2018.

[27] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-

Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single

image super-resolution: Methods and results. In Proceed-

ings of the IEEE conference on computer vision and pattern

recognition workshops, pages 114–125, 2017.

[28] Kensuke Umehara, Junko Ota, and Takayuki Ishida. Appli-

cation of super-resolution convolutional neural network for

1638

enhancing image resolution in chest ct. Journal of digital

imaging, 31(4):441–450, 2018.

[29] Qiang Wang, Xiaoou Tang, and Harry Shum. Patch based

blind image super resolution. In Tenth IEEE International

Conference on Computer Vision (ICCV’05) Volume 1, vol-

ume 1, pages 709–716. IEEE, 2005.

[30] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,

Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-

hanced super-resolution generative adversarial networks. In

Proceedings of the European Conference on Computer Vi-

sion (ECCV), 2018.

[31] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst,

Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc

Levoy, and Peyman Milanfar. Handheld multi-frame super-

resolution. ACM Transactions on Graphics (TOG), 38(4):1–

18, 2019.

[32] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dual-

gan: Unsupervised dual learning for image-to-image trans-

lation. In Proceedings of the IEEE international conference

on computer vision, pages 2849–2857, 2017.

[33] Chenyu You, Guang Li, Yi Zhang, Xiaoliu Zhang, Hong-

ming Shan, Mengzhou Li, Shenghong Ju, Zhen Zhao,

Zhuiyang Zhang, Wenxiang Cong, et al. Ct super-resolution

gan constrained by the identical, residual, and cycle learn-

ing ensemble (gan-circle). IEEE Transactions on Medical

Imaging, 39(1):188–203, 2019.

[34] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang,

Chao Dong, and Liang Lin. Unsupervised image super-

resolution using cycle-in-cycle generative adversarial net-

works. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition Workshops, pages 701–710,

2018.

[35] Kai Zhang, Luc Van Gool, and Radu Timofte. Deep unfold-

ing network for image super-resolution. In Proceedings of

the IEEE/CVF Conference on Computer Vision and Pattern

Recognition, pages 3217–3226, 2020.

[36] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning

a single convolutional super-resolution network for multi-

ple degradations. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages 3262–

3271, 2018.

[37] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng

Zhong, and Yun Fu. Image super-resolution using very deep

residual channel attention networks. In Proceedings of the

European Conference on Computer Vision (ECCV), pages

286–301, 2018.

[38] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A

Efros. Unpaired image-to-image translation using cycle-

consistent adversarial networks. In Proceedings of the IEEE

international conference on computer vision, pages 2223–

2232, 2017.

1639

dualsr: zero-shot dual learning for real-world super-resolution · 2020. 12. 18. · mohammad emad...

Documents