arxiv:1603.04771v2 [cs.cv] 1 aug 2016 · ten degraded by blur due to camera shake. the ability to...

A Neural Approach to Blind Motion Deblurring

Ayan Chakrabarti

Toyota Technological Institute at [email protected]

Abstract. We present a new method for blind motion deblurring thatuses a neural network trained to compute estimates of sharp imagepatches from observations that are blurred by an unknown motion ker-nel. Instead of regressing directly to patch intensities, this network learnsto predict the complex Fourier coefficients of a deconvolution filter to beapplied to the input patch for restoration. For inference, we apply thenetwork independently to all overlapping patches in the observed image,and average its outputs to form an initial estimate of the sharp image.We then explicitly estimate a single global blur kernel by relating thisestimate to the observed image, and finally perform non-blind deconvo-lution with this kernel. Our method exhibits accuracy and robustnessclose to state-of-the-art iterative methods, while being much faster whenparallelized on GPU hardware.

Keywords: Blind deconvolution, motion deblurring, deep learning.

1 Introduction

Photographs captured with long exposure times using hand-held cameras are of-ten degraded by blur due to camera shake. The ability to reverse this degradationand recover a sharp image is attractive to photographers, since it allows rescuingan otherwise acceptable photograph. Moreover, if this ability is consistent andcan be relied upon post-acquisition, it gives photographers more flexibility atthe time of capture, for example, in terms of shooting with a zoom-lens withouta tripod, or trading off exposure time with ISO in low-light settings. Beginningwith the seminal work of Fergus et al. [1], the last decade has seen considerableprogress [2,3,4,5,6,7,8] in the development of effective blind motion deblurringmethods that seek to estimate camera motion in terms of the induced blur kernel,and then reverse its effect. This progress has been helped by the development ofprincipled evaluation on standard benchmarks [3,9], that measure performanceover a large and diverse set of images.

Some deblurring algorithms [4,5] emphasize efficiency, and use inexpensiveprocessing of image features to quickly estimate the motion kernel. Despite theirspeed, these methods can yield remarkably accurate kernel estimates and achievehigh-quality restoration for many images, making them a practically useful post-processing tool for photographers. However, due to their reliance on relativelysimple heuristics, they also have poor outlier performance and can fail on a

1

arX

iv:1

603.

0477

1v2

[cs

.CV

] 1

Aug

201

6

Fig. 1. Neural Blind Deconvolution. Our method uses a neural network trained forper-patch blind deconvolution. Given an input patch blurred by an unknown motionblur kernel, this network predicts the Fourier coefficients of a filter to be applied tothat input for restoration. For inference, we apply this network independently on alloverlapping patches in the input image, and compose their outputs to form an initialestimate of the sharp image. We then infer a single global blur kernel that relates theinput to this initial estimate, and use that kernel for non-blind deconvolution.

significant fraction of blurred images. Other methods are iterative—they reasonwith parametric prior models for natural images and motion kernels, and usethese priors to successively improve the algorithm’s estimate of the sharp imageand the motion kernel. The two most successful deblurring algorithms fall [2,3]in this category, and while they are able to outperform previous methods by asignificant margin, they also have orders of magnitude longer running times.

In this work, we explore whether discriminatively trained neural networkscan match the performance of traditional methods that use generative naturalimage priors, and do so without multiple iterative refinements. Our work ismotivated by recent successes in the use of neural networks for other imagerestoration tasks (e.g., [10,11,12,13]). This includes methods [12,13] for non-blinddeconvolution, i.e., restoring a blurred image when the blur kernel is known.While the estimation problem in blind deconvolution is significantly more ill-posed than the non-blind case, these works provide insight into the design processof neural architectures for deconvolution.

Hradis et al. [14] explored the use of neural networks for blind deconvolu-tion on images of text. Since text images are highly structured—two-tone withthin sparse contours—a standard feed-forward architecture was able to achievesuccessful restoration. Meanwhile, Sun et al. [15] considered a version of theproblem with restrictions on motion blur types, and were able to successfullytrain a neural network to identify the blur in an observed natural image patchfrom among a small discrete set of oriented box blur kernels of various lengths.

2

Recently, Schuler et al. [16] tackled the general blind motion deblurring problemusing a neural architecture designed to mimic the computational steps of tradi-tional iterative deblurring methods. They designed learnable layers to carry outextraction of salient local image features and kernel estimation based on thesefeatures, and stacked multiple copies of these layers to enable iterative refine-ment. Remarkably, they were able to train this multi-stage network with relativesuccess. However, while their initial results are very encouraging, their currentperformance still significantly lags behind the state of the art [2,3]—especiallywhen the unknown blur kernel is large.

In this paper, we propose a new approach for blind deconvolution of naturalimages degraded by arbitrary motion blur kernels due to camera shake. At thecore of our algorithm is a neural network trained to restore individual imagepatches. This network differs from previous architectures in two significant ways:

1. Rather than formulate the prediction task as blur kernel estimation throughiterative refinement (as in [16]), or as direct regression to deblurred intensityvalues (as in [10,11,12,13]), we train our network to output the complexFourier coefficients of a deconvolution filter to be applied to the input patch.

2. We use a multi-resolution frequency decomposition to encode the inputpatch, and limit the connectivity of initial network layers based on localityin frequency (analogous to convolutional layers that are limited by localityin space). This leads to a significant reduction in the number of weights tobe learned during training, which proves crucial since it allows us to suc-cessfully train a network that operates on large patches, and therefore canreason about large blur kernels (e.g., in comparison to [16]).

For whole image restoration, the network is independently applied to everyoverlapping patch in the input image, and its outputs are composed to form aninitial estimate of the latent sharp image. Despite reasoning with patches inde-pendently and not sharing information about a common global motion kernel, wefind that this procedure by itself performs surprisingly well. We show that theseresults can be further improved by using the restored image to compute a globalblur kernel estimate, which is finally used for non-blind deconvolution. Evalua-tion on a standard benchmark [3] demonstrates that our approach is competitivewhen considering accuracy, robustness, and running time.

2 Patch-wise Neural Deconvolution

Let y[n] be the observed image of a scene blurred due to camera motion, andx[n] the corresponding latent sharp image that we wish to estimate, with n ∈Z2 indexing pixel location. The degradation due to blur can be approximatelymodeled as convolution with an unknown blur kernel k:

y[n] = (x ∗ k)[n] + ε[n], k[n] ≥ 0,∑n

k[n] = 1, (1)

where ∗ denotes convolution, and ε[n] is i.i.d. Gaussian noise.

3

As shown in Fig. 1, the central component of our algorithm is a neuralnetwork that carries out restoration locally on individual patches in y[n]. For-mally, our goal is to design a network that is able to recover the sharp inten-sity values xp = {x[n] : n ∈ p} of a patch p, given as input a larger patchyp+ = {x[n] : n ∈ p+}, p+ ⊃ p from the observed image. The larger input isnecessary since values in xp[n], especially near the boundaries of p, can dependon those outside yp[n]. In practice, we choose p+ to be of size 65 × 65, with itscentral 33× 33 patch corresponding to p. In this section, we describe our formu-lation of the prediction task for this network, its architecture and connectivity,and our approach to training it.

2.1 Restoration by Predicting Deconvolution Filter Coefficients

As depicted in Fig. 1, the output of our network are the complex discrete Fouriertransform (DFT) coefficients Gp+ [z] ∈ C of a deconvolution filter, where z in-dexes two-dimensional spatial frequencies in the DFT. This filter is then appliedDFT Yp+ [z] of the input patch yp+ [n]:

Xp+ [z] = Gp+ [z] × Yp+ [z]. (2)

Our estimate xp[n] of the sharp image patch is computed by taking the inverse

discrete Fourier transform (IDFT) of Xp+ [z], and then cropping out the centralpatch p ⊂ p+. Since x[n] and y[n] are both real valued and k is unit sum, weassume that Gp+ [z] = G∗p+ [−z], and Gp+ [0] = 1. Therefore, the network only

needs to output (|p+|−1)/2 unique complex numbers to characterize Gp+ , where|p+| is the number of pixels in p+.

Our training objective is that the output coefficients Gp+ [z] be optimal withrespect to the quality of the final sharp intensities xp[n]. Specifically, the lossfunction for the network is defined as the mean square error (MSE) between thepredicted and true sharp intensity values xp[n] and xp[n]:

L(xp, xp) =1

|p|∑n∈p

(xp[n]− xp[n])2. (3)

Note both the IDFT and the filtering in (2) are linear operations, and thereforeit is trivial to back-propagate the gradients of (3) to the outputs Gp+ [z], andsubsequently to all layers within the network.

Motivation. As with any neural-network based method, the validation of thedesign choices in our approach ultimately has to be empirical. However, we at-tempt to provide the reader with some insight into our motivation for makingthese choices. We begin by considering the differences between predicting de-convolution filter coefficients and regressing directly to pixel intensities xp[n], aswas done in most prior neural restoration methods [10,11,12,13]. Indeed, sincewe use the predicted coefficients to estimate xp[n] and define our loss with re-spect to the latter, our overall formulation can be interpreted as a regression to

4

xp[n]. However, our approach enforces a specific parametric form being enforcedon the learned mapping from yp+ [n] to xp[n]. In other words, the notion that thesharp and blurred image patches are related by convolution is “baked-in” to thenetwork’s architecture. Additionally, providing Yp+ [z] separately at the outputalleviates the need for the layers within our network to retain a linear encodingof the input patch all the way to the output.

Another alternative formulation could have been to set-up the network topredict the blur kernel k itself like in [16], which also encodes the convolutionalrelationship between the network’s input and output. However, remember thatour network works on local patches independently. For many patches, inferringthe blur kernel may be impossible from local information alone—for example, apatch with only a vertical edge would have no content in horizontal frequencies,making it impossible to infer the horizontal structure of the kernel. But in thesecases, it would still be possible to compute an optimal deconvolution filter andrestored image patch (in our example, the horizontal frequency values of Gp+ [z]would not matter). Moreover, our goal is to recover the restored image patchand estimating the kernel solves only a part of the problem, since non-blinddeconvolution is not trivial. In contrast, our predicted deconvolution filter canbe directly applied for restoration, and because it is trained with respect torestoration quality, the network learns to generate these predictions by reasoningboth about the unknown kernel and sharp image content.

It may be helpful to consider what the optimal values of Gp+ [z] shouldbe. One interpretation for these values can be derived from Wiener deconvo-lution [17], in which ideal restoration is achieved by applying a filter using (2)with coefficients given by

Gp+ [z] =(|K[z]|2 Sp+ [z] + σ2

ε

)−1K∗[z]Sp+ [z]. (4)

Here, K[z] is the DFT of the kernel k, and Sp+ [z] is the spectral profile of xp+ [n](i.e., a DFT of its auto-correlation function). Note that in blind deconvolution,both Sp+ [z] and K[z] are unknown and iterative algorithms can be interpreted asexplicitly estimating these quantities through sequential refinement. In contrast,our network is discriminatively trained to directly predict the ratio in (4).

2.2 Network Architecture

Our network needs to work with large input patches in order to successfullyhandle large blur kernels. This presents a challenge in terms of the number ofweights to be learned and the feasibility of training since, as observed by [13] andby us in our own experiments, the traditional strategy of making the initial lay-ers convolutional with limited support performs poorly for deconvolution. Thiswhy the networks for blind deconvolution in [12,13] have either used only fullyconnected layers [12], or large oriented on-dimensional convolutional layers [13].

We adopt a novel approach to parameterizing the input patch and definingthe connectivity of the initial layers in our network. Specifically, we use a multi-resolution decomposition strategy (illustrated in Fig. 2 (a)) where higher spatial

5

Fig. 2. Network Architecture. (a) To limit the number of weights in the network, weuse a multi-resolution decomposition to encode the input patch into four “bands”,containing low-pass (L), band-pass (B1, B2), and high-pass H frequency components.Higher frequencies are sampled at a coarser resolution, and computed from a smallerpatch centered on the input. (b) Our network regresses from this input encoding tothe complex Fourier coefficients of a restoration filter, and contains seven hidden layers(each followed by a ReLU activation). The connectivity of the initial layers is limitedto adjacent frequency bands. (c) Our network is trained with randomly generatedsynthetic motion blur kernels of different sizes.

frequencies are sampled with lower resolution. We compute DFTs at three dif-ferent levels, corresponding to patches of three different sizes (17 × 17, 33 × 33,and 65× 65) centered on the input patch, and from each retain the coefficientscorresponding to 4 < max |z| ≤ 8. Here, max |z| represents the larger magnitudeof the two components (horizontal and vertical) of the frequency indices in z.

This decomposition gives us 104 independent complex coefficients (or 208scalars) from each DFT level that we group into “bands”. Note that the indicesz correspond to different spatial frequencies ω for different sized DFTs, withcoefficients from the smaller-size DFTs representing a coarser sampling in thefrequency domain. Therefore, the three bands above correspond to high- andband-pass components of the input patch. We also construct a low-pass band byincluding the coefficients corresponding to max |z| ≤ 4 from the largest (i.e., ,65 × 65) decomposition. This band only has 81 scalar components (40 com-plex coefficients and a scalar DC coefficient). As suggested in [18], we apply ade-correlating linear transform to the coefficients of each band, based on theirempirical covariance on input patches in the training set.

6

Note that our decomposition also entails a dimensionality reduction—thetotal number of coefficients in the four bands is lower than the size of the inputpatch. Such a reduction may have been problematic if the network were directlyregressing to patch intensities. However, we find this approximate representationsuffices for our task of predicting filter coefficients, since the full input patchyp+ [n] (in the form of its DFT) is separately provided to (2) for the computationof the final output xp[n].

As depicted in Fig. 2 (b), we use a feed-forward network architecture withseven hidden layers to predict the coefficients Gp+ [z] from our encoding of theobserved blurry input patch. Units in the first layer are only connected to inputcoefficients from pairs of adjacent frequency bands—with groups of 1024 unitsconnected to each pair. Note that these groups do not share weights. We adopta similar strategy for the next layer, connecting units to pairs of adjacent groupsfrom the first layer. Each group in this layer has 2048 units. Restricting connec-tivity in this way, based on locality in frequency, reduces the number of weightsin our network, while still allowing good prediction in practice. This is not en-tirely surprising, since many iterative algorithms (including [2,3]) also divide theinference task into sequential coarse-to-fine reasoning at individual scales. Allremaining layers in our network are fully connected with 4096 units each. Unitsin all hidden layers have ReLU activations [19].

2.3 Training

Our network was trained on a synthetic dataset that is entirely disjoint fromthe evaluation benchmark [3]. This was constructed by extracting sharp imagepatches from images in the Pascal VOC 2012 dataset [20], blurring them withsynthetically generated kernels, and adding Gaussian noise. We set the noisestandard deviation to 1% to match the noise level in the benchmark [3].

The synthetic motion kernels were generated by randomly sampling six pointsin a limited size grid (we generate an equal number of kernels from grid sizes of8 × 8, 16 × 16, and 24 × 24), fitting a spline through these points, and settingthe kernel values at each pixel on this spline to a value sampled from a Gaussiandistribution with mean one and standard deviation of half. We then clippedthese values to be positive, and normalized the kernel to be unit sum.

There is an inherent phase ambiguity in blind deconvolution—one can applyequal but opposite translations to the blur kernel and sharp image estimatesto come up with equally plausible explanations for an observation. While thisambiguity need not be resolved globally, we need our local Gp+ [z] estimates inoverlapping patches to have consistent phase. Therefore, we ensured that thetraining kernels have a “canonical” translation by centering them so that eachkernel’s center of mass (weighted by kernel values) is at the center of the window.Figure 2 (c) shows some of the kernels generated using this approach.

We constructed separate training and validation sets with different sharppatches and randomly generated kernels. We used about 520,000 and 3,000 im-age patches and 100,000 and 3,000 kernels for the training and validation sets

7

respectively. While we extracted multiple patches from the same image, we en-sured that the training and validation patches were drawn from different images.To minimize disk access, we loaded the entire set of sharp patches and kernelsinto memory. Training data was generated on the fly by selecting random pairsof patches and kernels, and convolving the two to create the input patch. Wealso used rotated and mirrored versions of the sharp patches. This gave us anear inexhaustible supply of training data. Validation data was also generatedon the fly, but we always chose the same pairs of patches and kernels to ensurethat validation error could be compared across iterations.

We used stochastic gradient descent for minimizing the loss function (3), witha batch-size of 512 and a momentum value of 0.9. We trained the network for atotal of 1.8 million iterations, which took about 3 days using an NVIDIA TitanX GPU. We used a learning rate of 32 (higher rates caused gradients to explode)for the first 800k iterations, at which point validation error began to plateau.For the remaining iterations, we dropped the rate by a factor of

√2 every 100k

iterations. We kept track of the validation error across iterations, and at the endof training, used the weights that yielded the lowest value of that error.

3 Whole Image Restoration

Given an observed blurry image y[n], we consider all overlapping patches yp+ inthe image, and use our trained network to compute estimates xp of their latentsharp versions. We then combines these restored patches to form an initial anestimate xN [n] of the sharp image, by setting xN [n] to the average of its estimatesxp[n] from all patches p 3 n that contain it, using a Hanning window to weightthe contributions from different patches.

While this feed-forward and purely local procedure achieves reasonable restora-tion, we have so far not taken into account the fact that the entire image hasbeen blurred by the same motion kernel. To do so, we compute an estimate ofthe global kernel k[n], by relating the observed image y[n] to our neural-averageestimate xN [n]. Formally, we estimate this kernel k[n] as

k = arg min∑i

‖(k ∗ (fi ∗ xN ))− (fi ∗ y)‖2, (5)

subject to the constraint that k[n] > 0 and∑n k[n] = 1. We do not assume

that the size of the kernel is known, and always estimate k[n] within a fixed-sizesupport (51 × 51 as is standard for the benchmark [3]). Here, fi[n] are variousderivative filters (we use first and second order derivatives at 8 orientations).Like in [3], we only let strong gradients participate in the estimation process bysetting values of (fi ∗xN ) to zero except those at the two percent pixel locationswith the highest magnitudes.

This approach to estimating a global kernel from an estimate of the latentsharp image is fairly standard. But while it is typically used repeatedly within aniterative procedure that refines the estimates of the sharp image as well (e.g., in[2,3]), we estimate the kernel only once from the neural average output.

8

We adopt a relatively simple and fast approach to optimizing (5) under thepositivity and unit sum constraints on k. Specifically, we minimize L1 regularizedversions of the objective:

kλ = arg min∑i

‖(k ∗ (fi ∗ xN ))− (fi ∗ y)‖2 + λ∑n

|k[n]|, (6)

for a small range of values for the regularization weight λ. This optimization, foreach value of λ, can be done very efficiently in the Fourier domain using half-quadratic splitting [21]. We clip each kernel estimate kλ[n] to be positive, set verysmall or isolated values to zero, and normalize the result to be unit sum. We thenpick the kernel kλ[n] which yields the lowest value of the original un-regularizedcost in (5). Given this estimate of the global kernel, we use EPLL [22]—a state-of-the-art non-blind deconvolution algorithm—to deconvolve y[n] and arrive atour final estimate of the sharp image x[n].

4 Experiments

We evaluate our approach on the benchmark dataset of Sun et al. [3], whichconsists of 640 blurred images generated from 80 high quality natural images,and 8 real motion blur kernels acquired by Levin et al. [9]. We begin by ana-lyzing patch-wise predictions from our neural network, and then compare theperformance of our overall algorithm to the state of the art.

4.1 Local Network Predictions

Figure 3 illustrates the typical behavior of our trained neural network on indi-vidual patches. All patches in the figure are taken from the same image from [3],which means that they were all blurred by the same kernel (the kernel, and itsFourier coefficients, are also shown). However, we see that the predicted restora-tion filter coefficients are qualitatively different across these patches.

While some of this variation is due to the fact that the network is reasoningwith these patches independently, remember from Sec. 2.1 that we expect theideal restoration filter to vary based on image content. The predicted filters inFig. 3 can be understood in that context as attempting to amplify different sub-sets of the frequencies attenuated by the blur kernel, based on which frequenciesthe network believes were present in the original image. Comparing the Fouriercoefficients of the ground truth sharp patch to our restored outputs, we see thatour network restores many frequency components attenuated in the observedpatch, without amplifying noise.

These examples also validate our decision to estimate a restoration filterinstead of the blur kernel from individual patches. Most patches have no contentin entire ranges of frequencies even in their ground-truth sharp versions (mostnotably, the patch in the last row that is nearly flat), which makes estimatingthe corresponding frequency components of the kernel impossible. However, we

9

Fig. 3. Examples of per-patch restoration using our network. Shown here are differ-ent patches extracted from a blurred image from [3]. For each patch, we show theobserved blurry patch in the spatial domain, its DFT coefficients Y [z] (in terms oflog-magnitude), the predicted filter coefficients G[z] from our network, the DFT of theresulting restored patch X[z], and the restored patch in the spatial domain. As com-parison, we also show the ground truth sharp image patch and its DFT, as well as thecommon ground truth kernel of the network and its DFT.

are still able to restore these patches since that just requires identifying thatthose frequency components are absent.

Looking at the restored patches in the spatial domain, we note that whilethey are sharper than the input, they still have a lot of high-frequency infor-mation missing. However, remember that even our direct neural estimate xN [n]of the sharp image is composed by averaging estimates from multiple patchesat each pixel (see Fig. 1, and the supplementary material for examples of theseestimates). Moreover, our final estimates are computed by fitting a global kernelestimate to these locally restored outputs, benefiting from the fact that correctlyrestored frequencies in all patches are coherent with the same (true) blur kernel.

4.2 Performance Evaluation

Next, we evaluate our overall method and compare it to several recent algo-rithms [2,3,4,5,6,7,8,16] on the Sun et al. benchmark [3]. Deblurring quality ismeasured in terms of the MSE between the estimated and the ground truthsharp image, ignoring a fifty pixel wide boundary on all sides in the latter, andafter finding the crop of the restored estimate that aligns best with this groundtruth. Performance on the benchmark is evaluated [2,3] using quantiles of theerror ratio r between the MSE of the estimated image and that of the deconvolv-ing the observed image with the ground truth kernel using EPLL [22]. Resultswith r ≤ 5 are considered to correspond to “successful” restoration [2].

10

Fig. 4. Cumulative distributions of the error ratio r for different methods on the Sunet al. [3] benchmark. These errors were computed after using EPLL [22] for blind de-convolution using the global kernel estimates from each algorithm. The only exceptionis the “neural average” version of our method, where the errors correspond to those ofour initial estimates computed by directly averaging per-patch neural network outputs,without reasoning about a global kernel.

Table 1. Quantiles of error-ratio r and success rate (r ≤ 5), along with kernel estima-tion time for different methods

Method Mean 95%-ile Max Success Rate Time

(Neural Avg.) 4.92 9.39 19.11 61%Proposed 3.01 5.76 11.04 92% 65s (GPU)

Michaeli & Irani [2] 2.57 4.49 9.31 96% 91min (CPU)Sun et al. [3] 2.38 5.98 23.07 93% 38min (CPU)Xu & Jia [4] 3.63 9.97 65.33 86% 25s (CPU)

Schuler et al. [16] 4.53 11.21 20.96 67% 22s (CPU)Cho & Lee [5] 8.69 40.59 111.19 66% 1s (CPU)Levin et al. [6] 6.56 15.13 40.87 47% 4min (CPU)

Krishnan et al. [7] 11.65 34.93 133.21 25% 3min (CPU)Cho et al. [8] 28.13 89.67 164.94 12% 1min (CPU)

Figure 4 shows the cumulative distribution of the error-ratio for all methodson the benchmark. We also report specific quantiles of the error ratio—mean,and outlier performance in terms of 95%-ile and maximum value—as well as thesuccess rate of each method in Table 1. The results for [2] and [16] were providedby their authors, while those for all other methods are from [3]. Results for allmethods were obtained using EPLL for blind-deconvolution based on their kernelestimates, and are therefore directly comparable to those of our overall method.We also report the performance of our initial estimates from just the directneural averaging step in Fig. 4 and Table 1, which did not involve any globalkernel estimation or the use of non-blind deconvolution with EPLL.

11

All 13x13 15x15 17x17 19x19 21x21 23x23 23x23 27x270

10

20

30

40

50

60

70

80

90

100

Su

cce

ss R

ate

Proposed (Full)

Michaeli & Irani

Sun et al.

Xu & Jia

Schuler et al.

Fig. 5. Success rates of different methods, evaluated over the entire Sun et al. [3]dataset, and separately over images blurred with each of the 8 kernels. The kernels aresorted according to size (noted on the x-axis).

The performance of the full version of our method performs is close to thatof the two state-of-the-art methods of Michaeli and Irani [2] and Sun et al. [3].While our mean errors are higher than those of both and [2] and [3], we havea near identical success rate and better outlier performance than [3]. Note thatour method outperforms the remaining algorithms by a significant margin onall metrics. The best amongst these is the efficient approach of Xu and Jia [4]which is able to perform well on many individual images, but has higher errorsand succeeds less often on average than our approach and that of [2,3]. Figure 5compares the success rate of different methods over individual kernels in thebenchmark, to study the effect of kernel size. We see that the previous neuralapproach of [16] suffers a sharp drop in accuracy for larger kernel sizes. In con-trast, our method’s performance is more consistent across the whole range ofkernels in [3] (albeit, our worst performance is with the largest kernel).

In addition to accuracy, Table 1 also reports the running time for kernel es-timation for all methods. We see that while our method has nearly comparableperformance to the two state-of-the-art methods [2,3], it offers a significant ad-vantage over them in terms of speed. A MATLAB implementation of our methodtakes a total of only 65 seconds for kernel estimation using an NVIDIA TitanX GPU. The majority of this time, 45 seconds, is taken to compute the ini-tial neural-average estimate xN . On the other hand, [2] and [3] take 91 minutesand 38 minutes respectively, using the MATLAB/C implementations of thesemethods provided by their authors on an I-7 3.3GHz CPU with 6 cores.

While [2,3]’s running times could potentially be improved if they are reimple-mented to use a GPU, their ability to benefit from parallelism is limited by thefact that both are iterative techniques whose computations are largely sequen-tial (in fact, we only saw speed-ups of 1.4X and 3.5X in [2] and [3], respectively,when going from one to six CPU cores). In contrast, our method maps naturallyto parallel architectures and is able to fully saturate the available cores on aGPU. Batched forward passes through a neural network are especially efficient

12

Observed Proposed Michaeli & Irani Sun et al. Xu & Jia

Fig. 6. Example deblurred results from different methods. The estimated kernels areshown inset, with the ground truth kernel shown with the observed images in the firstcolumn. This figure is best viewed in the electronic version.

13

on GPUs, which is what the bulk of our computation involves—applying thelocal network independently and in-parallel on all patches in the input image.

Some methods in Table 1 are able to use simpler heuristics or priors to achievelower running times. But these are far less robust and have lower success rates—[4] has the best performance amongst this set. Our method therefore provides anew and practically useful trade-off between reliability and speed.

In Fig. 6, we show some examples of estimated kernels and deblurred out-puts from our method and those from [2,3,4]. In general, we find that most ofthe failure cases of [2,3,4] correspond to scenes that are a poor fit to their hand-crafted generative image priors—e.g., most of [3,4]’s failure cases correspond toimages that lack well-separated strong edges. Our discriminatively trained neuralnetwork derives its implicit priors automatically from the statistics of the train-ing set, and is relatively more consistent across different scene types, with failurecases corresponding to images where the network encounters ambiguous texturesthat it can’t generalize to. We refer the reader to the supplementary materialand our project website at http://www.ttic.edu/chakrabarti/ndeblur formore results. The MATLAB implementation of our method, along with trainednetwork weights, is also available at the latter.

5 Conclusion

In this paper, we introduced a neural network-based method for blind imagedeconvolution. The key component of our method was a neural network thatwas discriminatively trained to carry out restoration of individual blurry imagepatches. We used intuitions from a frequency-domain view of non-blind decon-volution to formulate the prediction task for the network and to design its archi-tecture. For whole image restoration, we averaged the per-patch neural outputsto form an initial estimate of the sharp image, and then estimated a global blurkernel from this estimate. Our approach was found to yield comparable per-formance to state-of-the-art iterative blind deblurring methods, while offeringsignificant advantages in terms of speed.

We believe that our network can serve as a building block for other ap-plications that involve reasoning with blur. Given that it operates on local re-gions independently, it is likely to be useful for reasoning about spatially-varyingblur—e.g., arising out of defocus and subject motion. We are also interested inexploring architectures and pooling strategies that allow efficient sharing of infor-mation across patches. We expect that such sharing can be used to communicateinformation about a common blur kernel, to exploit “internal” statistics of theimage (which forms the basis of the method of [2]), and also to identify and adaptto texture statistics of different scene types (e.g., [16] demonstrated improvedperformance when training and testing on different image categories).

Acknowledgments. The author was supported by a gift from Adobe Systems,and by the donation of a Titan X GPU from NVIDIA Corporation that wasused for this research.

14

http://www.ttic.edu/chakrabarti/ndeblur

References

1. Fergus, R., Singh, B., Hertzmann, A., Roweis, S.T., Freeman, W.T.: Removingcamera shake from a single photograph. SIGGRAPH (2006)

2. Michaeli, T., Irani, M.: Blind deblurring using internal patch recurrence. In:Proc. ECCV. (2014)

3. Sun, L., Cho, S., Wang, J., Hays, J.: Edge-based blur kernel estimation using patchpriors. In: Proc. ICCP. (2013)

4. Xu, L., Jia, J.: Two-phase kernel estimation for robust motion deblurring. In:Proc. ECCV. (2010)

5. Cho, S., Lee, S.: Fast motion deblurring. In: SIGGRAPH. (2009)6. Levin, A., Weiss, Y., Durand, F., Freeman, W.T.: Efficient marginal likelihood

optimization in blind deconvolution. In: Proc. CVPR. (2011)7. Krishnan, D., Tay, T., Fergus, R.: Blind deconvolution using a normalized sparsity

measure. In: Proc. CVPR, IEEE (2011)8. Cho, T.S., Paris, S., Horn, B.K., Freeman, W.T.: Blur kernel estimation using the

radon transform. In: Proc. CVPR. (2011)9. Levin, A., Weiss, Y., Durand, F., Freeman, W.T.: Understanding and evaluating

blind deconvolution algorithms. In: Proc. CVPR. (2009)10. Burger, H.C., Schuler, C.J., Harmeling, S.: Image denoising: Can plain neural

networks compete with BM3D? In: Proc. CVPR. (2012)11. Eigen, D., Krishnan, D., Fergus, R.: Restoring an image taken through a window

covered with dirt or rain. In: Proc. ICCV. (2013)12. Schuler, C.J., Burger, H.C., Harmeling, S., Scholkopf, B.: A machine learning

approach for non-blind image deconvolution. In: Proc. CVPR. (2013)13. Xu, L., Ren, J.S., Liu, C., Jia, J.: Deep convolutional neural network for image

deconvolution. In: NIPS. (2014)14. Hradis, M., Kotera, J., Zemcık, P., Sroubek, F.: Convolutional neural networks for

direct text deblurring. In: Proc. BMVC. (2015)15. Sun, J., Cao, W., Xu, Z., Ponce, J.: Learning a convolutional neural network for

non-uniform motion blur removal. In: Proc. CVPR. (2015)16. Schuler, C.J., Hirsch, M., Harmeling, S., Scholkopf, B.: Learning to deblur. PAMI

(pp. 2015)17. Brown, R.G., Hwang, P.Y.: Introduction to Random Signals and Applied Kalman

Filtering. 3 edn. John Wiley & Sons (1996)18. LeCun, Y., Bottou, L., Orr, G., Muller, K.: Efficient backprop. In: Neural Net-

works: Tricks of the trade, Springer (1998)19. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann ma-

chines. In: Proc. ICML. (2010)20. Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman,

A.: The pascal visual object classes challenge: A retrospective. IJCV (2014)21. Geman, D., Yang, C.: Nonlinear image recovery with half-quadratic regularization.

Trans. Imag. Proc. (1995)22. Zoran, D., Weiss, Y.: From learning models of natural image patches to whole

image restoration. In: Proc. ICCV. (2011)

15

arxiv:1603.04771v2 [cs.cv] 1 aug 2016 · ten degraded by blur due to camera shake. the ability to...

Documents