dualsr: zero-shot dual learning for real-world super-resolution · 2020. 12. 18. · mohammad emad...

10
DualSR: Zero-Shot Dual Learning for Real-World Super-Resolution Mohammad Emad Eindhoven University of Technology Eindhoven, The Netherlands [email protected] Maurice Peemen Thermo Fisher Scientific Eindhoven, The Netherlands [email protected] Henk Corporaal Eindhoven University of Technology Eindhoven, The Netherlands [email protected] Abstract Advanced methods for single image super-resolution (SISR) based upon Deep learning have demonstrated a re- markable reconstruction performance on downscaled im- ages. However, for real-world low-resolution images (e.g. images captured straight from the camera) they often gen- erate blurry images and highlight unpleasant artifacts. The main reason is the training data that does not reflect the real-world super-resolution problem. They train the net- work using images downsampled with an ideal (usually bicubic) kernel. However, for real-world images the degra- dation process is more complex and can vary from image to image. This paper proposes a new dual-path architecture (DualSR) that learns an image-specific low-to-high resolu- tion mapping using only patches of the input test image. For every image, a downsampler learns the degradation pro- cess using a generative adversarial network, and an up- sampler learns to super-resolve that specific image. In the DualSR architecture, the upsampler and downsampler are trained simultaneously and they improve each other using cycle consistency losses. For better visual quality and elim- inating undesired artifacts, the upsampler is constrained by a masked interpolation loss. On standard benchmarks with unknown degradation kernels, DualSR outperforms recent blind and non-blind super-resolution methods in term of SSIM and generates images with higher perceptual qual- ity. On real-world LR images it generates visually pleasing and artifact-free results. 1. Introduction The aim of Single Image Super-Resolution (SISR) is to upsample a low-resolution (LR) image and reconstruct the high-resolution (HR) details. Recently, these super- Bicubic SAN+ [7] KernelGAN [2] + ZSSR [26] DualSR (ours) Figure 1: Example of 2x SR applied to a real-world image from RealSR [5] dataset. resolution (SR) methods have entered our daily life by aid- ing low end smartphone cameras [31, 23]. Furthermore the restoration of historical LR photos to clean HR results is performed by novel SISR methods. Even old movies are converted to high-definition video quality. Next to the me- 1630

Upload: others

Post on 07-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • DualSR: Zero-Shot Dual Learning for Real-World Super-Resolution

    Mohammad Emad

    Eindhoven University of Technology

    Eindhoven, The Netherlands

    [email protected]

    Maurice Peemen

    Thermo Fisher Scientific

    Eindhoven, The Netherlands

    [email protected]

    Henk Corporaal

    Eindhoven University of Technology

    Eindhoven, The Netherlands

    [email protected]

    Abstract

    Advanced methods for single image super-resolution

    (SISR) based upon Deep learning have demonstrated a re-

    markable reconstruction performance on downscaled im-

    ages. However, for real-world low-resolution images (e.g.

    images captured straight from the camera) they often gen-

    erate blurry images and highlight unpleasant artifacts. The

    main reason is the training data that does not reflect the

    real-world super-resolution problem. They train the net-

    work using images downsampled with an ideal (usually

    bicubic) kernel. However, for real-world images the degra-

    dation process is more complex and can vary from image to

    image. This paper proposes a new dual-path architecture

    (DualSR) that learns an image-specific low-to-high resolu-

    tion mapping using only patches of the input test image. For

    every image, a downsampler learns the degradation pro-

    cess using a generative adversarial network, and an up-

    sampler learns to super-resolve that specific image. In the

    DualSR architecture, the upsampler and downsampler are

    trained simultaneously and they improve each other using

    cycle consistency losses. For better visual quality and elim-

    inating undesired artifacts, the upsampler is constrained by

    a masked interpolation loss. On standard benchmarks with

    unknown degradation kernels, DualSR outperforms recent

    blind and non-blind super-resolution methods in term of

    SSIM and generates images with higher perceptual qual-

    ity. On real-world LR images it generates visually pleasing

    and artifact-free results.

    1. Introduction

    The aim of Single Image Super-Resolution (SISR) is

    to upsample a low-resolution (LR) image and reconstruct

    the high-resolution (HR) details. Recently, these super-

    Bicubic SAN+ [7] KernelGAN [2] + ZSSR [26]

    DualSR (ours)

    Figure 1: Example of 2x SR applied to a real-world image from

    RealSR [5] dataset.

    resolution (SR) methods have entered our daily life by aid-

    ing low end smartphone cameras [31, 23]. Furthermore the

    restoration of historical LR photos to clean HR results is

    performed by novel SISR methods. Even old movies are

    converted to high-definition video quality. Next to the me-

    1630

  • dia industry these SR techniques have important other ap-

    plications in medical imaging [33, 28], remote sensing [12],

    microscopy [10], surveillance [22] and so on.

    The introduction of Convolutional Neural Networks

    (CNNs) have revolutionized computer vision and image

    processing techniques such as super-resolution. Many re-

    cently introduced SR methods based upon deep-learning,

    e.g. [8, 18, 24, 19, 30, 37, 7], learn the complicated LR-

    HR upsampling relations on huge datasets. Compared to

    traditional earlier methods these provide a significantly im-

    proved HR result. However, these pre-trained DL methods

    often perform much worse on captured images straight from

    a camera. They are trained on clean, noise-free, syntheti-

    cally generated LR images, while the degradation process

    for real-world LR images is different from the ideal condi-

    tions. To a large extent, this is due to the supervised training

    scheme that is not representative for the real-world problem.

    Additionally, each cameras differs in acquisition parameters

    such as the Point Spread Function (PSF) of the sensor. Even

    images captured by the same camera will differ because of

    different light conditions, depth of field, blur due to shaking

    and so on. These conditions make it intractable to train a

    single CNN that performs well on all different image degra-

    dation conditions.

    Blind SR methods solve the super-resolution problem

    with less assumptions on the degradation process. Often

    these assume that the LR image is the result of a down-

    sampled HR image, convolved with a blur kernel k, and

    added noise n:

    ILR = (IHR ∗ k) ↓s +n (1)

    Many blind SR methods estimate the degradation pro-

    cess before they perform a parameterized super-resolution

    operation. The state-of-the-art techniques [2, 11, 3] use

    deep learning to learn an image-specific downsampler

    (degradation model parameters) that is used by the upsam-

    pler to super-resolve the input LR image. However, esti-

    mating a proper downsampler from a single input image is

    complicated. Especially in the presence of noise or other ac-

    quisition artifacts these methods often fail to estimate good

    degradation parameters. A wrong degradation severely re-

    duces the effectiveness of the upsampler, and reduces the

    SR performance.

    With the true downsampler one can determine the up-

    sampler more accurately. On the other hand, with the

    true upsampler, one can correctly estimate the downsam-

    pler. In other words, the upsampler and downsampler are

    the inverse of each other and improving one can also im-

    prove the other. This relation motivated us to simultane-

    ously train both the upsampler and downsampler in a single

    pipeline. Inspired by recent unsupervised methods such as

    CycleGAN [38] and DualGAN [32], we introduce DualSR,

    a dual-path architecture for super-resolution on real-world

    LR images. In DualSR, the downsampler learns the patch

    distribution of the input image using a KernelGAN based

    kernel estimation [2]. However, unlike KernelGAN, the up-

    sampler and downsampler are trained simultaneously and

    improve each other using cycle consistency losses. The up-

    sampler is constrained by a novel masked interpolation loss

    that gives better visual quality and eliminates undesired ar-

    tifacts in the output result. On every new input image, the

    networks are trained from scratch using only patches of the

    new input image. We evaluate our method on existing syn-

    thesized and real-world benchmarks that shows how Du-

    alSR outperforms recent SR methods. Figure 1 compares

    the output of different methods on a real-world LR image.

    In summery, the contributions of this work are three-

    fold:

    • Inspired by [38, 32], we propose a dual-path architec-ture optimized for blind super-resolution that can be

    trained in reasonable time using only the input LR im-

    age.

    • We introduce a new masked interpolation loss that sub-stantially reduces oversharpening and suppresses un-

    wanted artifacts in the output image.

    • We evaluate our method on existing synthetically gen-erated and real-world benchmarks and we compare to

    the state-of-the-art blind and non-blind SR methods.

    2. Related work

    Super-resolution on LR images with an unknown degra-

    dation process is not completely new. Before the advent

    of deep learning, several learning-based methods [21, 29,

    13, 14] have been introduced to address this problem. Very

    recently, some deep learning based methods for blind and

    non-blind SR have been proposed. Non-blind SR assumes

    the degradation process is known beforehand. For exam-

    ple, ZSSR [26] trains an image-specific CNN by using the

    recurrence of small patches across different scales in a sin-

    gle image. SRMD [36] uses dimensionality stretching that

    enables a convolutional SR network to take degradation pa-

    rameters (i.e. blur kernel and noise level) as input. A grid

    search strategy is used to find the best configuration of these

    parameters. USRNet [35] is another non-blind SR method

    that tries to alternatively solve a data subproblem and a prior

    subproblem under the MAP (maximum a posteriori) frame-

    work.

    On the other hand, Blind methods try to estimate the

    degradation process before upsampling. IKC [11] proposes

    an iterative correction scheme for blur kernel estimation

    and uses Spatial Feature Transform (SFT) layers to handle

    multiple blur kernels. KernelGAN [2] estimates the image-

    specific blur kernel by internal learning of a Generative Ad-

    versarial Network (GAN) using only LR test image. The

    1631

  • fake

    real

    Interpolation lossForward cycle:

    Backward cycle:

    GUP GDN

    ½x

    2x1x1x

    1x 1x

    GDN(GUP(x)) ≈ xGUP(x))

    x

    GDN GUP

    DDN1x

    yGDN(y))

    GUP(GDN(y)) ≈ y

    1xx

    Figure 2: The network architecture of the proposed DualSR. GUP is the upsampler, GDN is the downsampler and DDN is the discrimina-

    tor. The top dataflow represents the forward cycle where we apply the upsampler before downsampler and the bottom part represents the

    backward cycle where upsampler is applied after downsampler.

    estimated kernel can be used by non-blind methods such

    as ZSSR and USRNet to obtain super-resolved output im-

    age. Cornillère et al. [6] propose BlindSR that estimates the

    degradation setting by analysing the artifacts generated in

    the super-resolved HR output. They train a kernel discrim-

    inator that predicts errors present due to wrong kernel esti-

    mation and then they recover the correct kernel by minimiz-

    ing the discriminator error. Very recently, based upon the

    generalized sampling theory, Hussein et al. [16] proposes a

    closed-form derivation of their correction filter to transform

    the degraded LR image such that it matches the bicubic

    downsampling result. Then, the modified LR image (sim-

    ilar to bicubic downsampling) is upsampled using existing

    state-of-the-art DNNs trained on bicubically downsampled

    images. In case of isotropic degradations, the transforma-

    tion to bicubic gives acceptable results. However, for more

    complicated (non-isotropic) degradations the results are not

    reported.

    Two similar works, CycleGAN [38] and DualGAN [32]

    introduce an effective architecture for the image-to-image

    translation where paired images are not available. They

    connect the main translator (generator) G:X→Y with its in-verse translator F:Y→X. In addition, to reduce the solutionspace, they introduce cycle consistency loss that encourages

    F(G(x)) ≈ x and G(F(y)) ≈ y. Recently, new works haveproposed similar structures for blind SR. Bulat et al. [3]

    propose a two-stage pipeline with High-to-Low and Low-

    to-High GANs. A cycle-in-cycle network structure is intro-

    duced in [34]. This model first maps the noisy and blurry in-

    put to cleaned-up LR space, next a pre-trained SR network

    is used to upsample the clean LR image. GAN-CIRCLE

    [33] uses a similar structure as CycleGAN, however the

    application is super-resolution for Computed Tomography

    (CT) images. All these CycleGAN-based methods need un-

    paired LR and HR datasets for training. As we explained,

    even for images from a single camera, the degradation pro-

    cess may be different and collecting these unpaired datasets

    is not always possible. This is in contrast to our method that

    does not require any other data than a test image.

    3. Proposed method

    We propose DualSR, a new blind SR method, that super-

    resolves real-world LR images with different acquisition

    processes (different downsampling kernels). It is a dual-

    path pipeline inspired by CycleGAN and DualGAN that can

    be trained in reasonable time using only patches of the LR

    input image. Figure 2 illustrates the proposed DualSR ar-

    chitecture. In this figure, GUP is the upsampler that trains to

    upsample the specific LR image. GDN is the downsampler

    that learns the degradation process. It aims to downsample

    the input LR image such that, at patch level, the distribution

    of downsampled image is as close as possible to the input

    image itself. It is shown in [2] that this internal distribu-

    tion of patches can be learned using an internal GAN [25]

    trained on patches of the input image (LGAN in equation 2).DDN is the discriminator that learns to distinguish between

    real (patches of the input LR image) and GDN generated

    output patches.

    1632

  • Ideally, the upsampler and downsampler should be the

    inverse of each other. This implies that GDN (GUP (x)) =x and GUP (GDN (y)) = y are valid. These conditions aredemonstrated in figure 2 as forward and backward cycles

    and we refer to their loss functions as forward and back-

    ward cycle consistency losses (Lcycle). In the forward cy-cle, we first apply the upsampler to generate a 2x upsampled

    image. Then the downsampler is applied and converts the

    upsampled image back to 1x. Similarly, in backward cycle,

    at first a 1/2x downsampled version of the input is generated

    by GDN , and then GUP upsamples the image back to the

    original scale. These two cycle consistency losses ensure

    that GUP and GDN can revert the operation done by the

    other.

    In addition to the cycle consistency losses, the upsam-

    pler is constrained by a novel masked interpolation loss

    (Linterp) that is applied between the input patch x and its 2xupsampled image GUP (x). This loss function encouragesthe upsampler to preserve the color composition of its in-

    put and eliminates unpleasant artifacts without making the

    output image blurry. The full loss function for training the

    upsampler and downsampler is:

    Ltotal = LGAN + λcycleLcycle + λinterpLinterp (2)

    Where λ values control the importance of the different loss

    function terms.

    3.1. Adversarial loss

    Accurate degradation model estimation is essential for

    blind SR. The importance of an accurate estimate of the

    blur kernel is significantly larger than that of a sophisticated

    prior (pre-training on an external dataset) [9]. To find the

    blur kernel, we use the idea of KernelGAN and embed it in

    our dual-path architecture. It estimates the image-specific

    degradation kernel using a Generative Adversarial Network

    (GAN) which tries to preserve the distribution of patches

    across scales of the LR image. The generator GDN is a

    model of the degradation process and downsamples the in-

    put patch y such that, the output GDN (y) is indistinguish-able by the discriminator from small input patches. We de-

    fine the adversarial loss for the generator as:

    LGAN = Ey[DDN (GDN (y))− 1]2 +R (3)

    Where the regularization term R applies realistic explicitpriors on the estimated kernel (explained in [2]). On the

    other hand, the discriminator tries to distinguish fake im-

    ages generated by GDN from real patches of input LR im-

    age and its objective is:

    LD = Ex[(DDN (x)− 1)2] + Ey[DDN (GDN (y))

    2] (4)

    We also experimented with an adversarial loss for the up-

    sampled image GUP (x) by adding a DUP discriminator.However, without example HR images, it was not helpful

    and resulted in undesired artifacts in the output image.

    (b)

    Uniform interp

    (a)

    w/o interp

    (c)

    Masked interp

    (d)

    fmask

    Figure 3: Effect of interpolation loss on the output. (a) SR result

    without interpolation loss. (b) SR result with uniform interpola-

    tion loss: ‖GUP (x) − Bicubic(x)‖1. (c) SR result with maskedinterpolation loss represented in equation 7. (d) Frequency mask

    generate by equation 6.

    3.2. Cycle consistency loss

    Both forward and backward cycle consistency losses

    play an essential role in training of DualSR. The forward

    cycle facilitates the training convergence of the downsam-

    pler GDN and also forces the upsampler GUP to generate

    images invertible by GDN . In the backward cycle, the up-

    sampler learns to reconstruct the input LR image from its

    downsampled version generated by GDN . Learning LR-

    HR relations from the downsampled version of the input

    image was firstly introduced by ZSSR [26], but in DualSR,

    it is implicitly part of the backward cycle loss. The final cy-

    cle consistency loss is the sum of the forward and backward

    losses:

    Lcycle = Ex‖GDN (GUP (x))− x‖1

    + Ey‖GUP (GDN (y))− y‖1(5)

    3.3. Masked interpolation loss

    Because there is no direct supervision for training GUP ,

    ringing effects often occur around sharp edges of the out-

    put image. In addition, unwanted artifacts show up espe-

    cially in the low-frequency areas of the output image. This

    problem is even more severe when GDN is not represent-

    ing an accurate estimate of degradation model. Figure 3(a)

    shows the output of DualSR without interpolation loss when

    the estimated degradation kernel slightly differs from the

    ground-truth kernel. To eliminate these artifacts, we intro-

    duce a weighted cost function that minimizes the difference

    between a bicubically upsampled image and the output of

    GUP .

    It is well-known that bicubic interpolation correctly up-

    samples low-frequency areas, however it does not recon-

    struct high-frequency details. Hence, applying a uniform

    interpolation cost to all pixels generates an artifact-free

    but blurry result (see figure 3(b)). To avoid blurriness,

    we only apply an interpolation cost to low-frequency parts

    of the image. For this purpose, we generate a frequency

    mask (fmask) by applying Sobel operator to the bicubic-

    1633

  • upsampled image:

    fmask = 1− Sobel(Bicubic(x)) (6)

    This mask has higher values for pixels in low-frequency ar-

    eas and lower values for pixels in high-frequency areas of

    the image (see figure 3(d)). We define masked interpolation

    loss as:

    Linterp = Ex‖[GUP (x)−Bicubic(x)]× fmask‖1 (7)

    It encourages GUP to follow bicubic interpolation in only

    low-frequency areas of the image. As it is shown in fig-

    ure 3(c), it generates images with sharp edges without

    adding artifacts.

    4. Implementation

    Network configuration Unlike CycleGAN, our DualSR

    generators (GUP and GDN ) have different network archi-

    tectures. For a single image the LR-HR conversions can

    be performed by small networks unlike the huge networks

    that train on large datasets. For the upsampler, similar to

    ZSSR, a simple 8-layer fully convolutional network with

    ReLU activations is employed. There is a global residual

    connection between the input and output [18, 26]. We up-

    scale the LR image to the output size before feeding it into

    the network. The network architecture from KernelGAN is

    used for downsampling and the corresponding discrimina-

    tor. The generator used for downsampling is a deep linear

    network (without any activations) and the discriminator is

    a fully convolutional PatchGAN [17] with a receptive field

    of 7x7. The small receptive field enforces to use only lo-

    cal features (e.g. edges) of the LR image, instead of relying

    on high-level global features. Hence, the generator GDNlearns the kernel that generates images with a patch distri-

    bution similar to the input LR image.

    Training details We train all networks GUP , GDN and

    DDN from scratch for every input image. In each iter-

    ation, we train generators and discriminator successively,

    each time with a batch of two patches from the LR input.

    The patches are 64x64 and 128x128 (patches x and y in fig-

    ure 2). Since SR kernels are not always symmetric, no geo-

    metric transformation is applied during training or test time.

    We tuned hyper parameters in equation 2 with a grid search

    strategy. We changed λcycle from 0 to 7.5 and λinterp from

    0 to 4 and calculated the average PSNR for the first 10 im-

    ages in DIV2KRK [2] dataset. The results are demonstrated

    in figure 4. It shows that the best performance happens for

    λcycle = 5 and λinterp = 2. We train the networks for 3000iterations with ADAM optimizer. The initial learning rate

    is 0.001 for GUP and 0.0002 for GDN and DDN and both

    are decreased every 750 iterations. The final super-resolved

    image is obtained by running the trained upsampler on the

    LR input image.

    0.0 2.5 5.0 7.5

    cycle

    30.0

    30.5

    31.0

    31.5

    32.0

    PSN

    R (d

    b)

    0 1 2 3 4

    interp

    31.0

    31.2

    31.4

    31.6

    31.8

    Figure 4: Performance analysis of DualSR with different values

    for λcycle and λinterp. For the left plot, we set λinterp = 2 andchange λcycle from 0 to 7.5. For the right plot, we set λcycle = 5and change λinterp from 0 to 4. PSNR values are calculated on

    the first 10 images in DIV2KRK [2] dataset.

    Run-time Since we train DualSR on fixed-size patches

    cropped from the input image, the training time is almost

    independent of image size. The average training+inference

    time for our method is 233 seconds on an RTX 2080 Ti

    GPU. For the combination of KernelGAN [2] + ZSSR [26]

    the run-time is 281 seconds and for BlindSR [6] it is 370

    seconds. Supervised deep learning SR methods like SAN

    [7] have a very long training time and image size signifi-

    cantly influences their inference time. For SAN+, it takes

    298 seconds, on average, to super-resolve each image of

    DIV2KRK benchmark [2].

    5. Experiments and results

    DualSR is designed to super-resolve real-world LR im-

    ages. However, these images have no ground-truth, so these

    do not enable a quantitative evaluation. Hence, we use

    synthetically generated LR images with unknown degra-

    dation settings (from DIV2KRK [2], Urban100 [15] and

    NTIRE2017 [27] benchmarks) to compare DualSR against

    state-of-the-art methods. In addition, we experiment on

    real-world LR images (from RealSR [5] dataset) and com-

    pare the results qualitatively. In the end, we perform an

    ablation study to investigate the influence of each loss term

    in the overall loss function.

    5.1. Evaluation on synthesized images

    In this section, we evaluate DualSR on three bench-

    marks that simulate real-world LR images. The first one is

    DIV2KRK [2] benchmark which uses DIV2K [1] validation

    set (100 high-quality images) as HR ground-truth. It con-

    structs the LR images by convolving an 11x11 anisotropic

    Gaussian blur kernel to the image before downsampling.

    Kernels vary for every image and have different shape and

    orientation. A uniform multiplicative noise is also applied

    to each kernel. We use the same degradation approach to

    generate LR images from Urban100 [15] dataset. As the

    1634

  • Category Method DIV2KRK [2] Urban100 [15] NTIRE2017 [27]

    Bicubic Interpolation 28.73 / 0.8040 23.32 / 0.6859 27.72 / 0.7689

    1st Bicubic kernel + ZSSR [26] 29.10 / 0.8215 23.78 / 0.7143 27.80 / 0.7726

    (Bicubically trained) EDSR+ [20] 29.17 / 0.8216 23.01 / 0.6857 27.78 / 0.7720

    SAN+ [7] 29.21 / 0.8232 23.81 / 0.7153 27.78 / 0.7721

    2nd

    (Blind methods)

    KernelGAN [2] + ZSSR 30.36 / 0.8669 24.66 / 0.7706 27.53 / 0.7572

    KernelGAN + ZSSR (masked interp) 30.40 / 0.8595 24.71 / 0.7692 28.04 / 0.7808

    KernelGAN + USRNet [35] 27.94 / 0.8084 21.81 / 0.6827 23.84 / 0.6662

    BlindSR [6] 31.36 / 0.8720 25.18 / 0.7742 28.35 / 0.7931

    DualSR (ours) 30.92 / 0.8728 25.04 / 0.7803 28.82 / 0.8045

    3rd

    (Oracle kernel)

    GT kernel + ZSSR 32.44 / 0.8955 26.38 / 0.8252 -

    GT kernel + USRNet 32.23 / 0.8981 24.89 / 0.7821 -

    GT kernel + DualSR 32.72 / 0.9030 26.04 / 0.8122 -

    Table 1: Quantitative results (PSNR / SSIM) for 2x SR on different datasets. For comparison to other works, the PSNR and SSIM values

    of the Y channel are reported. The best performance numbers are highlighted in red and the second best numbers are highlighted in blue.

    HRBicubic SAN+ [7]KernelGAN[2] + ZSSR [26] DualSR (ours)LR input

    DIV

    2K

    RK

    exam

    ple

    s [2

    ]

    KernelGAN[2] + USRNet [35] BlindSR [6]

    NT

    IRE

    2017 e

    xam

    ple

    s [2

    7]

    Urb

    an100 e

    xam

    ple

    s [1

    5]

    Figure 5: Qualitative comparison for 2x SR on synthetically generated datasets.

    1635

  • HRBicubic SAN+ [7]KernelGAN[2] + ZSSR [26] DualSR (ours)LR input

    KernelGAN[2] + USRNet [35] BlindSR [6]

    Figure 6: Qualitative comparison for 2x SR on real-world images (RealSR dataset).

    third benchmark, we use the track 2 dataset of NTIRE2017

    SISR challenge [27]. Similar to DIV2KRK, this benchmark

    uses DIV2K validation set as HR ground-truth and is differ-

    ent in synthesizing the LR counterparts. The downsampling

    operator is more complex and unknown for this benchmark.

    Table 1 reports the PSNR and SSIM values. There are

    three categories of SR results in this table. The first category

    are SR methods which are trained on images downsampled

    with bicubic interpolation. To the best of our knowledge,

    SAN+ [7] is state-of-the-art in this category. For compar-

    ison, we also report the result of ZSSR [26] without ker-

    nel estimation (ZSSR using bicubic kernel as downsam-

    pler). Methods in this category are evaluated on a differ-

    ent degradation process than bicubic and due to their in-

    flexibility these methods perform poorly. The second cate-

    gory contains blind SR results. For non-blind methods like

    ZSSR and USRNet [35], we use the blur kernel estimated

    by KernelGAN [2]. In contrast, BlindSR [6] and our Du-

    alSR methods have an integrated degradation kernel esti-

    mator. Finally, in the third category, the potential of pre-

    vious methods is evaluated by providing the ground-truth

    blur kernels. There are no NTIRE2017 numbers, since the

    ground-truth kernel is not available.

    According to table 1, the combination of KernelGAN

    and ZSSR performs better on DIV2KRK and Urban100

    than methods from the first category. However, for the

    NTIRE2017 benchmark the combination of KenelGAN and

    ZSSR provides lower quality. This is mainly due to the

    degradation process that is more complex and is not well

    estimated by KernelGAN. Integrating our novel masked in-

    terpolation loss into the ZSSR architecture improves the

    results for NTIRE2017 substantially even if the estimated

    kernel is not accurate. We observe that USRNet performs

    well when a GT kernel is provided, but for realistic ker-

    nel estimation scenarios (kernel estimated by KernelGAN)

    the performance is inferior to other SR methods. Finally,

    BlindSR and DualSR methods provide best performance re-

    sults. BlindSR assumes that the blur kernel is convolution of

    classic filters with an anisotropic Gaussian kernel. As a re-

    sult, it provides the best PSNR numbers for DIV2KRK and

    Urban100 benchmarks where the blur kernel is Gaussian.

    However, for NTIRE2017 benchmark, the degradation pro-

    cess is not Gaussian so here DualSR outperforms BlindSR

    in both metrics. In addition, DualSR produces images with

    better perceptual quality and it outperforms other methods

    on all datasets in terms of SSIM.

    The qualitative comparison of different SR methods is

    shown in Figure 5. Note that bicubic interpolation and

    SAN+ tends to produce blurry and oversmoothed images.

    On the other hand, the results of KernelGAN+ZSSR and

    KernelGAN+USRNet are oversharpended and contain se-

    vere ringing artifacts around edges and in smooth areas of

    the image. Thanks to the masked interpolation loss and im-

    proved kernel estimation, these artifacts are not present in

    the DualSR results. In comparison with BlindSR, DualSR

    generates sharper images with better visual quality.

    1636

  • 5.2. Evaluation on real data

    After experiments on synthesized images, we evaluate

    our method on real-world LR images. For this purpose,

    we use the new RealSR [5] dataset which is used in the

    NTIRE2019 competition [4]. This dataset contains raw im-

    ages captured by DSLR cameras. Multiple images of the

    same scene have been captured with different focal lengths.

    Images taken by longer focal lengths contain finer details

    and can be considered as HR counterparts for images with

    shorter focal lengths. Although RealSR provides images

    on different scales, it is really hard to obtain image pairs

    that are totally aligned. That is because of complicated mis-

    alignment between images and changes in the imaging sys-

    tem introduced by adjusting the focal length. As a result,

    we only consider visual comparison of SR results. We use

    images captured with 28mm focal length as LR inputs and

    images taken at 50mm focal length as HR counterparts.

    Figure 1 and figure 6 show the visual comparison of 2x

    upsampling on RealSR dataset. The degradation process

    is unknown and complicated for the LR images. There-

    fore, bicubic and SAN+ produce blurry results. Note that

    the ringing artifacts produced by KernelGAN+ZSSR and

    KernelGAN+USRNet are even more severe than in figure 5

    and it shows that for real-world images KernelGAN does

    not find the correct kernel. Even BlindSR cannot estimate

    the degradation process correctly and produces blurry re-

    sults. That is because, for real-world images, the degrada-

    tion process is more complicated than a simple anisotropic

    Gaussian kernel. In contrast, DualSR produces artifact-free

    photo-realistic natural images.

    5.3. Ablation study

    To study the contribution of each term in the loss func-

    tion for our dual-path architecture, we compare DualSR

    with ablations of the full version. Figure 7 illustrates visual

    comparison of different structures on two real-world images

    and table 2 shows PSNR and SSIM values on DIV2KRK

    benchmark. At first, we evaluate our method in the absence

    of masked interpolation loss. This model suffers from over-

    sharpening and intense artifacts in the output image. Next

    we remove forward and backward cycle consistency losses

    in separate experiments (in the presence of interpolation

    loss). In both cases, the results are not as sharp as the pro-

    posed full DualSR output. Having cycle consistency losses

    and interpolation loss combined together in the objective

    function, makes DualSR capable of generating realistic nat-

    ural images without unwanted artifacts.

    6. Conclusion

    Supervised deep learning methods cannot perform well

    when there is a mismatch between training and test data.

    This if very problematic for real-world SR problem where

    w/o masked interp loss

    w/o forward cycle loss

    w/o backward cycle loss

    DualSR full loss

    Figure 7: 2x SR results on real-world images (RealSR) generated

    by different variations of DualSR.

    Method PSNR SSIM

    DualSR w/o masked interp loss 30.15 0.8658

    DualSR w/o forward cycle loss 30.33 0.8505

    DualSR w/o backward cycle loss 30.16 0.8495

    DualSR (final) 30.92 0.8728

    Table 2: Quantitative comparison between different variations of

    DualSR on DIV2KRK dataset.

    the acquisition process varies for every image. We pro-

    posed DualSR, a small dual-path architecture, which learns

    per image the specific LR-HR relations. It consists of a

    downsampler and upsampler that improve each other dur-

    ing training using cycle consistency losses. In addition, we

    introduced a new masked interpolation loss that removes

    artifacts from low-frequency areas of the image without

    smoothing the edges. Experimental results demonstrate a

    significant improvement over state-of-the-art SR methods.

    Not relying on external datasets makes DualSR very adap-

    tive to various conditions, e.g. you do not have to retrain on

    a large dataset when your camera lens settings change. This

    flexibility is very important when the degradation process

    varies a lot. Our future work will aim to extend DualSR

    to larger scale factors and harder use-cases with more ex-

    treme degradation settings. Example applications could be

    in the domain of medical imaging or electron microscopy.

    In these domains a super-resolution model should not hal-

    lucinate but still generate a good high-resolution image.

    References

    [1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge

    on single image super-resolution: Dataset and study. In Pro-

    ceedings of the IEEE Conference on Computer Vision and

    Pattern Recognition Workshops, pages 126–135, 2017.

    [2] Sefi Bell-Kligler, Assaf Shocher, and Michal Irani. Blind

    super-resolution kernel estimation using an internal-gan. In

    Advances in Neural Information Processing Systems, pages

    284–293, 2019.

    1637

  • [3] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To

    learn image super-resolution, use a gan to learn how to do

    image degradation first. In Proceedings of the European con-

    ference on computer vision (ECCV), pages 185–200, 2018.

    [4] Jianrui Cai, Shuhang Gu, Radu Timofte, and Lei Zhang.

    Ntire 2019 challenge on real image super-resolution: Meth-

    ods and results. In Proceedings of the IEEE Conference on

    Computer Vision and Pattern Recognition Workshops, 2019.

    [5] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei

    Zhang. Toward real-world single image super-resolution: A

    new benchmark and a new model. In Proceedings of the

    IEEE International Conference on Computer Vision, pages

    3086–3095, 2019.

    [6] Victor Cornillère, Abdelaziz Djelouah, Wang Yifan, Olga

    Sorkine-Hornung, and Christopher Schroers. Blind image

    super resolution with spatially variant degradations. ACM

    Transactions on Graphics (proceedings of ACM SIGGRAPH

    ASIA), 38(6), 2019.

    [7] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and

    Lei Zhang. Second-order attention network for single im-

    age super-resolution. In Proceedings of the IEEE Conference

    on Computer Vision and Pattern Recognition, pages 11065–

    11074, 2019.

    [8] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou

    Tang. Image super-resolution using deep convolutional net-

    works. IEEE transactions on pattern analysis and machine

    intelligence, 38(2):295–307, 2015.

    [9] Netalee Efrat, Daniel Glasner, Alexander Apartsin, Boaz

    Nadler, and Anat Levin. Accurate blur models vs. image pri-

    ors in single image super-resolution. In Proceedings of the

    IEEE International Conference on Computer Vision, pages

    2832–2839, 2013.

    [10] Linjing Fang, Fred Monroe, Sammy Weiser Novak, Lyn-

    dsey Kirk, Cara R Schiavon, B Yu Seungyoon, Tong

    Zhang, Melissa Wu, Kyle Kastner, Yoshiyuki Kubota, et al.

    Deep learning-based point-scanning super-resolution imag-

    ing. bioRxiv, page 740548, 2019.

    [11] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong.

    Blind super-resolution with iterative kernel correction. In

    Proceedings of the IEEE conference on computer vision and

    pattern recognition, pages 1604–1613, 2019.

    [12] Juan Mario Haut, Ruben Fernandez-Beltran, Mercedes E

    Paoletti, Javier Plaza, Antonio Plaza, and Filiberto Pla. A

    new deep generative network for unsupervised remote sens-

    ing single-image super-resolution. IEEE Transactions on

    Geoscience and Remote Sensing, 56(11):6792–6810, 2018.

    [13] He He and Wan-Chi Siu. Single image super-resolution us-

    ing gaussian process regression. In CVPR 2011, pages 449–

    456. IEEE, 2011.

    [14] Yu He, Kim-Hui Yap, Li Chen, and Lap-Pui Chau. A soft

    map framework for blind super-resolution image reconstruc-

    tion. Image and Vision Computing, 27(4):364–373, 2009.

    [15] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single

    image super-resolution from transformed self-exemplars. In

    Proceedings of the IEEE conference on computer vision and

    pattern recognition, pages 5197–5206, 2015.

    [16] Shady Abu Hussein, Tom Tirer, and Raja Giryes. Correction

    filter for single image super-resolution: Robustifying off-the-

    shelf deep super-resolvers. In Proceedings of the IEEE/CVF

    Conference on Computer Vision and Pattern Recognition,

    pages 1428–1437, 2020.

    [17] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A

    Efros. Image-to-image translation with conditional adver-

    sarial networks. In Proceedings of the IEEE conference on

    computer vision and pattern recognition, pages 1125–1134,

    2017.

    [18] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate

    image super-resolution using very deep convolutional net-

    works. In Proceedings of the IEEE conference on computer

    vision and pattern recognition, pages 1646–1654, 2016.

    [19] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,

    Andrew Cunningham, Alejandro Acosta, Andrew Aitken,

    Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-

    realistic single image super-resolution using a generative ad-

    versarial network. In Proceedings of the IEEE conference on

    computer vision and pattern recognition, pages 4681–4690,

    2017.

    [20] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and

    Kyoung Mu Lee. Enhanced deep residual networks for single

    image super-resolution. In Proceedings of the IEEE confer-

    ence on computer vision and pattern recognition workshops,

    pages 136–144, 2017.

    [21] Tomer Michaeli and Michal Irani. Nonparametric blind

    super-resolution. In Proceedings of the IEEE International

    Conference on Computer Vision, pages 945–952, 2013.

    [22] Pejman Rasti, Tonis Uiboupin, Sergio Escalera, and Gholam-

    reza Anbarjafari. Convolutional neural network super resolu-

    tion for face recognition in surveillance monitoring. In Inter-

    national conference on articulated motion and deformable

    objects, pages 175–184. Springer, 2016.

    [23] Yaniv Romano, John Isidoro, and Peyman Milanfar. Raisr:

    Rapid and accurate image super resolution. IEEE Transac-

    tions on Computational Imaging, 3(1):110–125, 2016.

    [24] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,

    Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan

    Wang. Real-time single image and video super-resolution

    using an efficient sub-pixel convolutional neural network. In

    Proceedings of the IEEE conference on computer vision and

    pattern recognition, pages 1874–1883, 2016.

    [25] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani.

    Ingan: Capturing and retargeting the ”dna” of a natural im-

    age. In The IEEE International Conference on Computer

    Vision (ICCV), 2019.

    [26] Assaf Shocher, Nadav Cohen, and Michal Irani. Zero-shot

    super-resolution using deep internal learning. In Proceed-

    ings of the IEEE Conference on Computer Vision and Pattern

    Recognition, pages 3118–3126, 2018.

    [27] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-

    Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single

    image super-resolution: Methods and results. In Proceed-

    ings of the IEEE conference on computer vision and pattern

    recognition workshops, pages 114–125, 2017.

    [28] Kensuke Umehara, Junko Ota, and Takayuki Ishida. Appli-

    cation of super-resolution convolutional neural network for

    1638

  • enhancing image resolution in chest ct. Journal of digital

    imaging, 31(4):441–450, 2018.

    [29] Qiang Wang, Xiaoou Tang, and Harry Shum. Patch based

    blind image super resolution. In Tenth IEEE International

    Conference on Computer Vision (ICCV’05) Volume 1, vol-

    ume 1, pages 709–716. IEEE, 2005.

    [30] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,

    Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-

    hanced super-resolution generative adversarial networks. In

    Proceedings of the European Conference on Computer Vi-

    sion (ECCV), 2018.

    [31] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst,

    Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc

    Levoy, and Peyman Milanfar. Handheld multi-frame super-

    resolution. ACM Transactions on Graphics (TOG), 38(4):1–

    18, 2019.

    [32] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dual-

    gan: Unsupervised dual learning for image-to-image trans-

    lation. In Proceedings of the IEEE international conference

    on computer vision, pages 2849–2857, 2017.

    [33] Chenyu You, Guang Li, Yi Zhang, Xiaoliu Zhang, Hong-

    ming Shan, Mengzhou Li, Shenghong Ju, Zhen Zhao,

    Zhuiyang Zhang, Wenxiang Cong, et al. Ct super-resolution

    gan constrained by the identical, residual, and cycle learn-

    ing ensemble (gan-circle). IEEE Transactions on Medical

    Imaging, 39(1):188–203, 2019.

    [34] Yuan Yuan, Siyuan Liu, Jiawei Zhang, Yongbing Zhang,

    Chao Dong, and Liang Lin. Unsupervised image super-

    resolution using cycle-in-cycle generative adversarial net-

    works. In Proceedings of the IEEE Conference on Computer

    Vision and Pattern Recognition Workshops, pages 701–710,

    2018.

    [35] Kai Zhang, Luc Van Gool, and Radu Timofte. Deep unfold-

    ing network for image super-resolution. In Proceedings of

    the IEEE/CVF Conference on Computer Vision and Pattern

    Recognition, pages 3217–3226, 2020.

    [36] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning

    a single convolutional super-resolution network for multi-

    ple degradations. In Proceedings of the IEEE Conference

    on Computer Vision and Pattern Recognition, pages 3262–

    3271, 2018.

    [37] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng

    Zhong, and Yun Fu. Image super-resolution using very deep

    residual channel attention networks. In Proceedings of the

    European Conference on Computer Vision (ECCV), pages

    286–301, 2018.

    [38] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A

    Efros. Unpaired image-to-image translation using cycle-

    consistent adversarial networks. In Proceedings of the IEEE

    international conference on computer vision, pages 2223–

    2232, 2017.

    1639