c 2010 ajit rajwade - university of...

240
PROBABILISTIC APPROACHES TO IMAGE REGISTRATION AND DENOISING By AJIT RAJWADE A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010

Upload: nguyenbao

Post on 11-Apr-2018

215 views

Category:

Documents


2 download

TRANSCRIPT

PROBABILISTIC APPROACHES TO IMAGE REGISTRATION AND DENOISING

By

AJIT RAJWADE

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2010

c© 2010 Ajit Rajwade

2

This thesis is being submitted with a feeling of gratitude for my parents and brother,

whom I consider to be my best and closest friends.

3

ACKNOWLEDGMENTS

I would like to thank my advisors Dr. Anand Rangarajan and Dr. Arunava Banerjee

for sharing with me their endless enthusiasm, knowledge, expertise and love for the

subject. I have come to admire not only their intellect but also their unassuming and

informal nature. They treat their students like friends! Anand and Arunava are two

individuals who are full of ideas, and who are willing to selflessly share those ideas with

everybody. I am indebted to them for having given me the freedom to pursue towards

my Ph.D. a problem that I was passionate about, namely image denoising. I am also

thankful to both of them for having played a big role in encouraging student-student

collaborations on research problems of mutual interest. Such open-mindedness and

enthusiasm is rare!

I would like to thank Dr. Jeffrey Ho, Dr. Baba Vemuri and Dr. Brett Presnell for

serving on my committee. I deeply appreciate Dr. Presnell’s efforts in reading my thesis

and suggesting me useful changes, and for discussions on probability density estimation

techniques. A word of sincere appreciation for several faculty members from the CISE

department: Dr. Alper Ungor, Dr. Sanjay Ranka, Dr. Pete Dobbins, Dr. Paul Gader

and Dr. Tim Davis, with whom I have worked as teaching assistant; and for Dr. Meera

Sitharam, with whom I participated in our local chapter of SPICMACAY, an organization

for promotion of Indian classical music.

Gainesville would have been a boring place without my room-mates and lab-mates:

Venkatakrishnan Ramaswamy, Subhajit Sengupta, Karthik Gurumoorthy, Bhupinder

Singh, Amit Dhurandhar, Gnana Sundar Rajendiran, Milapjit Sandhu, Ravneet

Singh Vohra, Sayan Banerjee, Alok Whig, Meizhu Liu, Ting Chen, Guang Chung,

Angelos Barmpoutis, Ritwik Kumar, Fei Wang, Bing Jian, Santhosh Kodipaka, Esen

Yuksel, Wenxing Ye, Yuchen Xie, Dohyung Seo, Sile Hu, Jason Chi, Shahed Nejhum,

Manu Sethi, Mohsen Ali, Adrian Peter, Neil Smith, Karthik Gopalkrishnan, Srikanth

Subramaniam, and many others. They all helped build a lively environment both at

4

home and in the lab. I consider myself lucky to have had two really wonderful friends:

Venkatakrishnan Ramaswamy (here at UF) and Gurman Singh Gill (at McGill), who have

been such genuine well-wishers all along! I have also come to admire Venkat’s ability to

ask (innumerable :-)) interesting questions on matters both technical and non-technical.

No words can be sufficient to thank my parents, my brother and my grandparents

who never let me feel that I was alone on this long, challenging and sometimes

frustrating journey. This thesis would have been impossible without their support. I

wish to express my sincerest gratitude to the Saswadkar and Iyengar families back

in Pune, who have been friends, philosphers and guides for my family, and who have

helped and supported us in just so many, many priceless ways!

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 PROBABILITY DENSITY WITH ISOCONTOURS AND ISOSURFACES . . . . 21

2.1 Overview of Existing PDF Estimators . . . . . . . . . . . . . . . . . . . . . 212.1.1 The Histogram Estimator . . . . . . . . . . . . . . . . . . . . . . . . 212.1.2 The Frequency Polygon . . . . . . . . . . . . . . . . . . . . . . . . 222.1.3 Kernel Density Estimators . . . . . . . . . . . . . . . . . . . . . . . 222.1.4 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.1.5 Wavelet-Based Density Estimators . . . . . . . . . . . . . . . . . . 25

2.2 Marginal and Joint Density Estimation . . . . . . . . . . . . . . . . . . . . 262.2.1 Estimating the Marginal Densities in 2D . . . . . . . . . . . . . . . 272.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.2.3 Other Methods for Derivation . . . . . . . . . . . . . . . . . . . . . 292.2.4 Estimating the Joint Density . . . . . . . . . . . . . . . . . . . . . . 302.2.5 From Densities to Distributions . . . . . . . . . . . . . . . . . . . . 332.2.6 Joint Density between Multiple Images in 2D . . . . . . . . . . . . 352.2.7 Extensions to 3D . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.2.8 Implementation Details for the 3D case . . . . . . . . . . . . . . . . 382.2.9 Joint Densities by Counting Points and Measuring Lengths . . . . . 39

2.3 Experimental Results: Area-Based PDFs Versus Histograms with SeveralSub-Pixel Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 APPLICATION TO IMAGE REGISTRATION . . . . . . . . . . . . . . . . . . . . 50

3.1 Entropy Estimators in Image Registration . . . . . . . . . . . . . . . . . . 503.2 Image Entropy and Mutual Information . . . . . . . . . . . . . . . . . . . . 533.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.1 Registration of Two images in 2D . . . . . . . . . . . . . . . . . . . 553.3.2 Registration of Multiple Images in 2D . . . . . . . . . . . . . . . . . 583.3.3 Registration of Volume Datasets . . . . . . . . . . . . . . . . . . . 58

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6

4 APPLICATION TO IMAGE FILTERING . . . . . . . . . . . . . . . . . . . . . . . 70

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3 Extensions of Our Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.1 Color Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.3.2 Chromaticity Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.3.3 Gray-scale Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.4 Level Curve Based Filtering in a Mean Shift Framework . . . . . . . . . . 774.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.1 Gray-scale Images . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.5.2 Testing on a Benchmark Dataset of Gray-scale Images . . . . . . . 804.5.3 Experiments with Color Images . . . . . . . . . . . . . . . . . . . . 814.5.4 Experiments with Chromaticity Vectors and Video . . . . . . . . . . 81

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 A RELATED PROBLEM: DIRECTIONAL STATISTICS IN EUCLIDEAN SPACE 95

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.2.1 Choice of Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965.2.2 Using Random Variable Transformation . . . . . . . . . . . . . . . 975.2.3 Application to Kernel Density Estimation . . . . . . . . . . . . . . . 995.2.4 Mixture Models for Directional Data . . . . . . . . . . . . . . . . . . 1015.2.5 Properties of the Projected Normal Estimator . . . . . . . . . . . . 103

5.3 Estimation of the Probability Density of Hue . . . . . . . . . . . . . . . . . 1045.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6 IMAGE DENOISING: A LITERATURE REVIEW . . . . . . . . . . . . . . . . . . 110

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106.2 Partial Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . 1116.3 Spatially Varying Convolution and Regression . . . . . . . . . . . . . . . . 1136.4 Transform-Domain Denoising . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.4.1 Choice of Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.4.2 Choice of Thresholding Scheme and Parameters . . . . . . . . . . 1186.4.3 Method for Aggregation of Overlapping Estimates . . . . . . . . . . 1196.4.4 Choice of Patch Size . . . . . . . . . . . . . . . . . . . . . . . . . . 119

6.5 Non-local Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216.6 Use of Residuals in Image Denoising . . . . . . . . . . . . . . . . . . . . . 124

6.6.1 Constraints on Moments of the Residual . . . . . . . . . . . . . . . 1246.6.2 Adding Back Portions of the Residual . . . . . . . . . . . . . . . . . 1256.6.3 Use of Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . 1256.6.4 Residuals in Joint Restoration of Multiple Images . . . . . . . . . . 126

6.7 Denoising Techniques using Machine Learning . . . . . . . . . . . . . . . 1276.8 Common Problems with Contemporary Denoising Techniques . . . . . . . 129

7

6.8.1 Validation of Denoising Algorithms . . . . . . . . . . . . . . . . . . 1296.8.2 Automated Filter Parameter Selection . . . . . . . . . . . . . . . . 131

7 BUILDING UPON THE SINGULAR VALUE DECOMPOSITION FOR IMAGEDENOISING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.2 Matrix SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.3 SVD for Image Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337.4 Oracle Denoiser with the SVD . . . . . . . . . . . . . . . . . . . . . . . . . 1347.5 SVD, DCT and Minimum Mean Squared Error Estimators . . . . . . . . . 136

7.5.1 MMSE Estimators with DCT . . . . . . . . . . . . . . . . . . . . . . 1367.5.2 MMSE Estimators with SVD . . . . . . . . . . . . . . . . . . . . . . 1387.5.3 Results with MMSE Estimators Using DCT . . . . . . . . . . . . . . 139

7.5.3.1 Synthetic patches . . . . . . . . . . . . . . . . . . . . . . 1397.5.3.2 Real images and a large patch database . . . . . . . . . 139

7.5.4 Results with MMSE Estimators Using SVD . . . . . . . . . . . . . . 1407.5.4.1 Synthetic patches . . . . . . . . . . . . . . . . . . . . . . 1407.5.4.2 Real images and a large patch database . . . . . . . . . 141

7.6 Filtering of SVD Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1427.7 Nonlocal SVD with Ensembles of Similar Patches . . . . . . . . . . . . . . 143

7.7.1 Choice of Patch Similarity Measure . . . . . . . . . . . . . . . . . . 1477.7.2 Choice of Threshold for Truncation of Transform Coefficients . . . . 1497.7.3 Outline of NL-SVD Algorithm . . . . . . . . . . . . . . . . . . . . . 1507.7.4 Averaging of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . 1507.7.5 Visualizing the Learned Bases . . . . . . . . . . . . . . . . . . . . 1507.7.6 Relationship with Fourier Bases . . . . . . . . . . . . . . . . . . . . 151

7.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1527.8.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . 1537.8.2 Comparison with KSVD . . . . . . . . . . . . . . . . . . . . . . . . 1537.8.3 Comparison with BM3D . . . . . . . . . . . . . . . . . . . . . . . . 1547.8.4 Comparison of Non-Local and Local Convolution Filters . . . . . . 1567.8.5 Comparison with 3D-DCT . . . . . . . . . . . . . . . . . . . . . . . 1577.8.6 Comparison with Fixed Bases . . . . . . . . . . . . . . . . . . . . . 1577.8.7 Visual Comparison of the Denoised Images . . . . . . . . . . . . . 158

7.9 Selection of Global Patch Size . . . . . . . . . . . . . . . . . . . . . . . . 1597.10 Denoising with Higher Order Singular Value Decomposition . . . . . . . . 160

7.10.1 Theory of the HOSVD . . . . . . . . . . . . . . . . . . . . . . . . . 1607.10.2 Application of HOSVD for Denoising . . . . . . . . . . . . . . . . . 1617.10.3 Outline of HOSVD Algorithm . . . . . . . . . . . . . . . . . . . . . 162

7.11 Experimental Results with HOSVD . . . . . . . . . . . . . . . . . . . . . . 1647.12 Comparison of Time Complexity . . . . . . . . . . . . . . . . . . . . . . . 164

8

8 AUTOMATED SELECTION OF FILTER PARAMETERS . . . . . . . . . . . . . 200

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2008.2 Literature Review on Automated Filter Parameter Selection . . . . . . . . 2018.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8.3.1 Independence Measures . . . . . . . . . . . . . . . . . . . . . . . . 2028.3.2 Characterizing Residual ‘Noiseness’ . . . . . . . . . . . . . . . . . 204

8.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2068.4.1 Validation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2078.4.2 Results on NL-Means . . . . . . . . . . . . . . . . . . . . . . . . . 2088.4.3 Effect of Patch Size on the KS Test . . . . . . . . . . . . . . . . . . 2098.4.4 Results on Total Variation . . . . . . . . . . . . . . . . . . . . . . . 210

8.5 Discussion and Avenues for Future Work . . . . . . . . . . . . . . . . . . 210

9 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . 220

9.1 List of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2209.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221

9.2.1 Trying to Reach the Oracle . . . . . . . . . . . . . . . . . . . . . . . 2219.2.2 Blind and Non-blind Denoising . . . . . . . . . . . . . . . . . . . . 2219.2.3 Challenging Denoising Scenarios . . . . . . . . . . . . . . . . . . . 222

APPENDIX

A DERIVATION OF MARGINAL DENSITY . . . . . . . . . . . . . . . . . . . . . . 224

B THEOREM ON THE PRODUCT OF A CHAIN OF STOCHASTIC MATRICES . 226

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240

9

LIST OF TABLES

Table page

2-1 Comparison between different methods of density estimation w.r.t. nature ofdomain, bias, speed, and geometric nature of density contributions . . . . . . . 43

2-2 Timing values for computation of joint PDFs and L1 norm of difference betweenPDF computed by sampling with that computed using iso-contours; Numberof bins is 128× 128, size of images 122× 146 . . . . . . . . . . . . . . . . . . . 45

3-1 Average and std. dev. of error in degrees (absolute difference between trueand estimated angle of rotation) for MI using Parzen windows . . . . . . . . . . 61

3-2 Average value and variance of parameters θ, s and t predicted by various methods(32 and 64 bins, noise σ = 0.2); Ground truth: θ = 30, s = t = −0.3 . . . . . . 66

3-3 Average value and variance of parameters θ, s and t predicted by various methods(32 and 64 bins, noise σ = 1); Ground truth: θ = 30, s = t = −0.3 . . . . . . . 67

3-4 Average error (absolute diff.) and variance in measuring angle of rotation usingMI, NMI calculated with different methods, noise σ = 0.05 . . . . . . . . . . . . 67

3-5 Average error (absolute diff.) and variance in measuring angle of rotation usingMI, NMI calculated with different methods, noise σ = 0.2 . . . . . . . . . . . . . 68

3-6 Average error (absolute diff.) and variance in measuring angle of rotation usingMI, NMI calculated with different methods, noise σ = 1 . . . . . . . . . . . . . . 68

3-7 Three image case: angles of rotation using MMI, MNMI calculated with theiso-contour method and simple histograms, for noise variance σ = 0.05, 0.1, 1(Ground truth 20 and 30) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3-8 Error (average, std. dev.) validated over 10 trials with LengthProb and histogramsfor 128 bins; R refers to the intensity range of the image . . . . . . . . . . . . . 69

4-1 MSE for filtered images using our method and using mean shift with Gaussiankernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4-2 MSE for filtered images using our method, using mean shift with Gaussiankernels and using mean shift with Epanechnikov kernels . . . . . . . . . . . . . 84

7-1 Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for DCT bases . . . . . . . . . . . . . . . . . . . . . . . 190

7-2 Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for SVD basis of the clean synthetic patch . . . . . . . . 190

7-3 PSNR values for noise level σ = 5 on the benchmark dataset . . . . . . . . . . 191

7-4 SSIM values for noise level σ = 5 on the benchmark dataset . . . . . . . . . . 191

10

7-5 PSNR values for noise level σ = 10 on the benchmark dataset . . . . . . . . . 192

7-6 SSIM values for noise level σ = 10 on the benchmark dataset . . . . . . . . . . 192

7-7 PSNR values for noise level σ = 15 on the benchmark dataset . . . . . . . . . 193

7-8 SSIM values for noise level σ = 15 on the benchmark dataset . . . . . . . . . . 193

7-9 PSNR values for noise level σ = 20 on the benchmark dataset . . . . . . . . . 194

7-10 SSIM values for noise level σ = 20 on the benchmark dataset . . . . . . . . . . 194

7-11 PSNR values: NL-SVD versus DCT for noise level σ = 20 on the benchmarkdataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

7-12 PSNR values for noise level σ = 25 on the benchmark dataset . . . . . . . . . 195

7-13 SSIM values for noise level σ = 25 on the benchmark dataset . . . . . . . . . . 196

7-14 PSNR values for noise level σ = 30 on the benchmark dataset . . . . . . . . . 197

7-15 SSIM values for noise level σ = 30 on the benchmark dataset . . . . . . . . . . 197

7-16 PSNR values for noise level σ = 35 on the benchmark dataset . . . . . . . . . 198

7-17 SSIM values for noise level σ = 35 on the benchmark dataset . . . . . . . . . . 198

7-18 Patch-size selection for σ = 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

8-1 (NL-Means) Gaussian noise σ2n = 0.0001 . . . . . . . . . . . . . . . . . . . . . 212

8-2 (NL-Means) Gaussian noise σ2n = 0.0005 . . . . . . . . . . . . . . . . . . . . . 215

8-3 (NL-Means) Gaussian noise σ2n = 0.001 . . . . . . . . . . . . . . . . . . . . . . 215

8-4 (NL-Means) Gaussian noise σ2n = 0.005 . . . . . . . . . . . . . . . . . . . . . . 216

8-5 (NL-Means) Gaussian noise σ2n = 0.01 . . . . . . . . . . . . . . . . . . . . . . . 216

8-6 (NL-Means) Gaussian noise σ2n = 0.05 . . . . . . . . . . . . . . . . . . . . . . . 217

8-7 (NL-Means) Uniform noise width = 0.001 . . . . . . . . . . . . . . . . . . . . . 217

8-8 (NL-Means) Uniform noise width = 0.01 . . . . . . . . . . . . . . . . . . . . . . 218

8-9 (TV) Gaussian noise σ2n = 0.0005 . . . . . . . . . . . . . . . . . . . . . . . . . . 218

8-10 (TV) Gaussian noise σ2n = 0.005 . . . . . . . . . . . . . . . . . . . . . . . . . . 219

11

LIST OF FIGURES

Figure page

2-1 p(α) ∝ area between level curves at α and α+ ∆α (i.e. region with red dots) . 42

2-2 (A) Intersection of level curves of I1 and I2: p(α1,α2) ∝ area of dark blackregions. (B) Parallelogram approximation: PDF contribution = area (ABCD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2-3 (A) Area of parallelogram increases as angle between level curves decreases(left to right). Level curves of I1 and I2 are shown in red and blue lines respectively(B) Joint probability contribution in the case of three images . . . . . . . . . . 43

2-4 A retinogram [1] and its rotated negative . . . . . . . . . . . . . . . . . . . . . 44

2-5 Following left to right and top to bottom, joint densities of the retinogram imagescomputed by histograms (using 16, 32, 64, 128 bins) and by our area-basedmethod (using 16, 32, 64 and 128 bins) . . . . . . . . . . . . . . . . . . . . . . 44

2-6 Marginal densities of the retinogram image computed by histograms [from (A)to (D)] and our area-based method [from (E) to (H)] using 16, 32, 64 and 128bins (row-wise order) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2-7 Probability contribution and geometry of isocontour pairs . . . . . . . . . . . . 46

2-8 Splitting a voxel (A) into 12 tetrahedra, two on each of the six faces of the voxel;and (B) into 24 tetrahedra, four on each of the six faces of the voxel . . . . . . 46

2-9 Counting level curve intersections within a given half-pixel . . . . . . . . . . . . 47

2-10 Biased estimates in 3D: (A) Segment of intersection of planar iso-surfacesfrom the two images, (B) Point of intersection of planar iso-surfaces from thethree images (each in a different color) . . . . . . . . . . . . . . . . . . . . . . . 47

2-11 Joint probability plots using: (A) histograms, 128 bins, (B) histograms, 256bins, (C) LengthProb, 128 bins and (D) LengthProb, 256 bins . . . . . . . . . . 48

2-12 Plots of the difference between the joint PDF (of the images in subfigure [A])computed by the area-based method and by histogramming with Ns sub-pixelsamples versus logNs using (B) L1 norm, (C) L2 norm, and (D) JSD . . . . . . 49

3-1 Graphs showing the average error and error standard deviation with MI as thecriterion for 16, 32, 64, 128 bins with a noise σ ∈ 0.05, 0.2and1 . . . . . . . . 62

3-2 MI with 32 and 128 bins for a noise level of 0.05, 0.2 and 1 . . . . . . . . . . . 63

3-3 MR slices of the brain (A) MR-PD slice, (B) MR-T1 slice rotated by 20 degrees,(C) MR-T2 slice rotated by 30 degrees . . . . . . . . . . . . . . . . . . . . . . . 64

12

3-4 MI computed using (A) histogramming and (B) LengthProb (plotted versusθY and θZ ); MMI computed using (C) histogramming and (D) 3DPointProb(plotted versus θ2 and θ3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3-5 MR-PD and MR-T1 slices before and after affine registration . . . . . . . . . . 65

4-1 Image contour maps in a neighborhood . . . . . . . . . . . . . . . . . . . . . . 83

4-2 True, degraded and denoised images . . . . . . . . . . . . . . . . . . . . . . . 85

4-3 True, degraded and denoised images . . . . . . . . . . . . . . . . . . . . . . . 86

4-4 True, degraded and denoised images . . . . . . . . . . . . . . . . . . . . . . . 87

4-5 True, degraded and denoised fingerprint images for three noise levels . . . . . 88

4-6 Performance plot on the benchmark dataset . . . . . . . . . . . . . . . . . . . . 89

4-7 True, degraded and denoised color images . . . . . . . . . . . . . . . . . . . . 90

4-8 True, degraded and denoised color images . . . . . . . . . . . . . . . . . . . . 91

4-9 True, degraded and denoised color images . . . . . . . . . . . . . . . . . . . . 92

4-10 True, degraded and denoised color images . . . . . . . . . . . . . . . . . . . . 93

4-11 True, degraded and denoised frames from a video sequence . . . . . . . . . . 94

5-1 A projected normal distribution (~µ0 = (1, 0),σ0 = 10) and a von-Mises distribution(~µ0 = (1, 0),κ0 =

|~µ0|σ20= 0.01) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5-2 Plot of projected normal and von-Mises densities . . . . . . . . . . . . . . . . . 109

6-1 Mandrill image: (A) with no noise, (B) with noise of σ = 10, (C) with noise ofσ = 20; the noise is hardly visible in the textured fur region (viewed best whenzoomed in the pdf file) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7-1 Global SVD Filtering on the Barbara image . . . . . . . . . . . . . . . . . . . . 166

7-2 Patch-based SVD filtering on the Barbara image . . . . . . . . . . . . . . . . . 167

7-3 Oracle filter with SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7-4 Fifteen synthetic patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

7-5 Threshold functions for DCT coefficients of (A) the sixth and (B) the seventhpatch from Figure 7-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7-6 DCT filtering with MAP and MMSE methods . . . . . . . . . . . . . . . . . . . . 170

7-7 DCT filtering with MAP and MMSE methods . . . . . . . . . . . . . . . . . . . . 171

13

7-8 Threshold functions for coefficients of (A) the sixth and (B) the seventh patchfrom Figure 7-4 when projected onto SVD bases of patches from the database 172

7-9 SVD filtering with MAP and MMSE methods . . . . . . . . . . . . . . . . . . . . 173

7-10 Motivation for Robust PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7-11 Barbara image, (A) reference patch, (B) patches similar to the reference patch(similarity measured on noisy image which is not shown here), (C) correlationmatrices (top row) and learned bases . . . . . . . . . . . . . . . . . . . . . . . 174

7-12 Mandrill image, (A) reference patch, (B) patches similar to the reference patch(similarity measured on noisy image which is not shown here), (C) correlationmatrices (top row) and learned bases . . . . . . . . . . . . . . . . . . . . . . . 175

7-13 DCT bases (8× 8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

7-14 Barbara image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F) outputof BM3D2, (G) output of HOSVD . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7-15 Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E) HOSVD 178

7-16 Boat image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22, (C)output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F) outputof BM3D2, (G) output of HOSVD . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7-17 Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E) HOSVD 180

7-18 Stream image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F) outputof BM3D2, (G) output of HOSVD . . . . . . . . . . . . . . . . . . . . . . . . . . 181

7-19 Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E) HOSVD 182

7-20 Fingerprint image: (A) clean image, (B) noisy version with σ = 20, PSNR =22, (C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD . . . . . . . . . . . . . . . . . . . . . . 183

7-21 Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E) HOSVD 184

7-22 For σ = 20, denoised Barbara image with NL-SVD (A) [PSNR = 30.96] andDCT (C) [PSNR = 29.92]. For the same noise level, denoised boat image withNL-SVD (B) [PSNR = 30.24] and DCT (D) [PSNR = 29.95]. . . . . . . . . . . . 185

7-23 (A) Checkerboard image, (B) Noisy version of the image with σ = 20, (C)Denoised with NL-SVD (PSNR = 34) and (D) DCT (PSNR = 27). Zoom in forbetter view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

14

7-24 Absolute difference between true Barbara image and denoised image producedby (A) NL-SVD, (B) BM3D1, (C) BM3D2. All three algorithms were run on imagewith noise σ = 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

7-25 A zoomed view of Barbara’s face for (A) the original image, (B) NL-SVD and(C) BM3D2. Note the shock artifacts on Barbara’s face produced by BM3D2. . 187

7-26 Reconstructed images when Barbara (with noise σ = 20) is denoised withNL-SVD run on patch sizes (A) 4 × 4, (B) 6 × 6, (C) 8 × 8, (D) 10 × 10, (E)12× 12, (F) 14× 14 and (G) 16× 16. . . . . . . . . . . . . . . . . . . . . . . . . 188

7-27 Residual images when Barbara (with noise σ = 20) is denoised with NL-SVDrun on patch sizes (A) 4× 4, (B) 6× 6, (C) 8× 8, (D) 10× 10, (E) 12× 12, (F)14× 14 and (G) 16× 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

8-1 Plots of CC, MI, P and MSE on an image subjected to upto 16000 iterationsof total variation denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212

8-2 Images produced by filters whose parameters were chosen by different noisenessmeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

8-3 Images produced by filters whose parameters were chosen by different noisenessmeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

15

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

PROBABILISTIC APPROACHES TO IMAGE REGISTRATION AND DENOISING

By

Ajit Rajwade

December 2010

Chair: Anand RangarajanCochair: Arunava BanerjeeMajor: Computer Engineering

We present probabilistically driven approaches to two major applications in

computer vision and image processing: image alignment (registration) and filtering

of intensity values corrupted by noise.

Some existing methods for these applications require the estimation of the

probability density of the intensity values defined on the image domain. Most of

the contemporary density estimation techniques employ different types of kernel

functions for smoothing the estimated density values. These kernels are unrelated to

the structure or geometry of the image. The present work chooses to depart from this

conventional approach to one which seeks to approximate the image as a continuous or

piecewise continuous function of the spatial coordinates, and subsequently expresses

the probability density in terms of some key geometric properties of the image, such

as its gradients and iso-intensity level sets. This framework, which regards an image

as a signal as opposed to a bunch of samples, is then extended to the case of joint

probability densities between two or more images and for different domains (2D and

3D). A biased density estimate that expressly favors the higher gradient regions of

the image is also presented. These techniques for probability density estimation are

used (1) for the task of affine registration of images drawn from different sensing

modalities, and (2) to build neighborhood filters in the well-known mean shift framework,

for the denoising of corrupted gray-scale and color images, chromaticity fields and

16

gray-scale video. Using our new density estimators, we demonstrate improvement in the

performance of these applications. A new approach for the estimation of the probability

density of spherical data is also presented, taking into account the fact that the source of

such data are commonly known or assumed to be Euclidean, particularly within the field

of image analysis.

We also develop two patch-based image denoising algorithms that revisit the old

patch-based singular value decomposition (SVD) technique proposed in the seventies.

Noise does not affect only the singular values of an image patch, but also severely

affects its SVD bases leading to poor quality denoising if those bases are used. With

this in mind, we provide motivation for manipulating the SVD bases of the image patches

for improving denoising performance. To this end, we develop a probabilistic non-local

framework which learns spatially adaptive orthonormal bases that are derived by

exploiting the similarity between patches from different regions of an image. These

bases act as a common SVD for the group of patches similar to any reference patch

in the image. The reference image patches are then filtered by projection onto these

learned bases, manipulation of the transform coefficients and inversion of the transform.

We present or use principled criteria for the notion of similarity between patches under

noise and manipulation of the coefficients, assuming a fixed known noise model. The

several experimental results reported show that our method is simple and efficient,

it yields excellent performance as measured by standard image quality metrics, and

has principled parameter settings driven by statistical properties of the natural images

and the assumed noise models. We term this technique the non-local SVD (NL-SVD)

and extend it to produce a second, improved algorithm based upon the higher order

singular value decomposition (HOSVD). The HOSVD-based technique filters similar

patches jointly and produces denoising results that are better than most existing popular

methods and very close to the state of the art technique in the field of image denoising.

17

CHAPTER 1INTRODUCTION

Image analysis is a flourishing field that has made great progress in the past few

decades. Techniques from image analysis have been employed in fields as diverse as

medicine, mechanical engineering, remote sensing, biometric identification, pathology

and cell biology, molecular chemistry and lithography. An incomplete list of the key

problems that current researchers in the field are working on, includes (1) image

inpainting, (2) image denoising and restoration under various degradation models such

as defocus blur or motion blur, fog or haze, rain etc., (3) alignment of images of an object

sensed from different viewpoints potentially from different sensing modalities (called as

rigid or affine image registration), and possibly with nontrivial deformations of the object

itself, especially in applications involving medical imaging or face recognition (called as

non-rigid registration), (4) tomography, (5) image fusion or mosaicing, (6) segmentation

of images into coherent parts or segments, and (7) object recognition under different

views or lighting conditions.

Many of these techniques heavily employ statistical or probabilistic approaches.

A fundamental component of all such approaches is the estimation of the probability

density function (hereafter referred to as the PDF) of the intensity values of the image

defined at different points on the image domain. There exist several techniques for

PDF estimation in the literature. A common component of all of these techniques is

the estimation of frequency counts of the different values of the intensity followed by

smoothing or interpolation between these values, using kernel functions, yielding a

smoothed PDF estimate. These kernels are not related to the geometry of the image

in any manner. This thesis takes the opposite approach based on actually taking into

account the fact that the image is a geometric object (or a ‘signal’ as opposed to a

‘bunch of samples’) and interpolates the available samples to create a continuous image

representation, which is used in itself for PDF estimation. The use of the interpolant

18

produces a smoothed estimate that obviates the need for a kernel and critical kernel

parameters such as the bandwidth. Moreover this method of building a PDF evolves a

clear relationship between probabilistic quantities (such as the PDF itself) and geometric

entities (such as the gradients and the level sets). This estimator is discussed in Chapter

2, following a literature review of contemporary PDF estimators. In Chapters 3 and 4

respectively, the new PDF estimator is employed for two applications - image registration

under affine transformations, and denoising of various types of images affected primarily

by independent and identically distributed noise. The former application considers

images acquired possibly under different lighting conditions or different modalities such

as MR-T1, MR-T2, MR-PD (three different magnetic resonance imaging modalities). The

proposed PDF estimator produces results that are more robust than other techniques

under fine intensity quantization and under image noise. The denoising technique

in Chapter 4 (an interpolant driven local neighborhood method in the mean-shift

framework) is tested on gray-scale images, color images, chromaticity fields and

gray-scale video. For gray-scale and color images, the proposed PDF estimator

produces better denoising results even when the neighborhood for averaging and

the smoothing parameters are small. In Chapter 5, the thesis also discusses a related

problem in the field of spherical (or directional) statistics where the samples are points

on a unit sphere. These data are usually obtained as some function computed from the

original data which are usually known or assumed to lie in Euclidean space. Examples

include chromaticity vectors of color images which are unit-normalized versions of the

red-green-blue (RGB) values output by a camera. In this work, an estimator is presented

which does not impose a kernel directly on the unit vectors, but which uses existing

estimators in the original Euclidean space following random variable transformation.

Chapter 6 presents a detailed overview of contemporary image denoising

techniques. In chapter 7, we propose a probabilistic technique that starts off by revisiting

the image singular value decomposition (SVD). We perform experiments with global

19

and local image SVD and propose different ways to manipulate the SVD bases of

noisy image patches, or the coefficients of image patches when projected onto these

bases. We discuss the inefficacy of some of these manipulations, but demonstrate

that replacement of the image patch SVD by a common basis that represents an

ensemble of patches which are all similar to a reference patch, yields excellent filtering

performance. In this technique, which we call the non-local SVD (NL-SVD), a different

basis is produced at every pixel. We present a notion of patch similarity under noise,

which makes use of the properties of the noise model. The actual filtering is performed

at the patch level by projecting the patches onto the basis tuned for that patch, followed

by subsequent modification of the projection coefficients, and inversion of the transform.

Our technique is thus simple, elegant and efficient and it yields performance competitive

with the current state of the art. We also present a second and improved algorithm that

employs the higher-order singular value decomposition (HOSVD), an extension of the

SVD to higher order matrices.

While the research on image filtering has been extensive, there is very little

literature on automated estimation of the parameters of the filtering algorithms (i.e.

without reference to the true, clean image which is unknown in practical denoising

scenarios). In Chapter 8, we present a new statistically driven criterion for automated

filter parameter selection under the assumption that the noise is i.i.d. with a loose

lower bound on its variance. The criterion measures the statistical similarity between

non-overlapping patches of the residual image (the difference between the noisy and

the denoised image). The criterion is empirically seen to correlate well with known

full-reference quality measures (i.e. those that measure the error between the denoised

image and the true image). We test the criterion in conjunction with the NLMeans

algorithm [2] and the total variation PDE for selecting the smoothing parameter in these

methods.

20

CHAPTER 2PROBABILITY DENSITY WITH ISOCONTOURS AND ISOSURFACES

2.1 Overview of Existing PDF Estimators

The most commonly used PDF estimators include the histogram, the frequency

polygon, the Parzen window (or kernel) density estimator, the Gaussian mixture model,

and the much more recent wavelet-based density estimator. In the following, we briefly

review key properties of each. The review material, presented here for the sake of

completeness, is a brief summary of what is found in standard textbooks on the topic

such as [3] and [4].

2.1.1 The Histogram Estimator

The histogram-based density estimator p(x) for a density p(x) is defined as follows:

p(x) =F (bj+1)− F (bj)

nh(2–1)

where (bj , bj+1] defines a bin-boundary, h denotes the bin-width, F (bk) denotes the

number of samples whose value is less than or equal to bk and n is the total number of

samples. The histogram estimator is the simplest and the most popular one owing to

its simplicity. However it has a number of problems. Firstly, the estimates its produces

are always non-differentiable, even though the underlying density may be differentiable.

The estimate is highly sensitive to the choice of bin boundaries and more importantly

to the choice of the bin-width h. Using a high value of h produces a highly biased (or

over-smoothed) estimate, whereas a very small value of h leads to the problem of very

high variability of the estimate for small changes in the sample values. This tradeoff is

another instance of the classic bias-variance dilemma in machine learning. The specific

expressions for the bias and variance of this estimator are given as follows (due to [4]):

Bias(p(x)) =h − 2x + 2bj

2p′(x) +O(h2)forx ∈ (bj , bj+1] (2–2)

Variance(p(x)) =f (x)

nh+O(

1

n). (2–3)

21

The expressions clearly indicate the quadratic increase in bias with increase in h, and

the increase in variance inversely proportional to h. Also clear is the fact that the bias

problem is more pronounced for densities with higher derivative values.

The quality of a density estimator is often given by its mean squared error (MSE)

which is given as follows for the histogram (due to [4]):

MSE[p(x)] = Variance(p(x)) + Bias2(p(x)) (2–4)

=f (x)

nh+ Kp′(x)2 +O(

1

n) +O(h3). (2–5)

Upon integrating the MSE across x , we get the mean integrated square error (MSE),

which is given as (due to [4]):

MISE[p(x)] =1

nh+O(

1

n) +O(h3) +

h2∫p′(x)2dx

12(2–6)

The bin-width which minimizes the MISE is shown to be O(n−1/3) and inversely

proportional to∫p′(x)2dx , leading to an asymptotic MISE value which is O(n−2/3)

[4]. This indicates that the optimal rate of convergence of a histogram-based density

estimator is O(n−2/3).

2.1.2 The Frequency Polygon

Histograms are by definition piecewise constant density estimators. A frequency

polygon is simply a piecewise linear extension to the simple histogram and is obtained

by straightforward linear interpolation in between the estimated density values defined at

the midpoints of adjacent bins. This innocuous change produces an MISE value with a

smaller bias term (O(h2) as opposed to the earlier O(h)). The analysis in [4] which uses

the bin-width value that optimizes the MISE, indicates an improved convergence rate of

O(n−4/5) as opposed to the earlier O(n−2/3).

2.1.3 Kernel Density Estimators

To alleviate the non-differentiability of the histogram and the frequency polygon,

kernel density estimators build a differentiable kernel centered at every sample point.

22

The estimate thus obtained is given as follows:

p(x) =1

nh

n∑i=1

K(x − xih) (2–7)

where n is the number of samples and h is the bandwidth. K(.) is called as the kernel

function which is defined to satisfy the following conditions:∫K(x)dx = 1 (2–8)∫xK(x)dx = 0 (2–9)∫

x2K(x)dx = σ2K > 0. (2–10)

The properties of the kernel density estimator are as follows:

Bias[p(x)] =h2σ2Kp

′′(x)

2+O(h4) (2–11)

Variance[p(x)] =f (x)R(K)

nh+O(

1

n) (2–12)

MISE[p(x)] = O(1

nh) +O(h4). (2–13)

The optimal MISE (corresponding to the value of h that optimizes the MISE) is shown

in [4] to be O(n−4/5), indicating a superior convergence over histograms, and having

the added merit of differentiability over frequency polygons. The common choices of

the kernel function include the Gaussian and the Epanechnikov. The latter is proved

to be the one which produces the best asymptotic MISE, though the Gaussian and

many other known kernels have been proved to be almost as good. This leads to the

conclusion that at least asymptotically, the choice of a kernel is not a major issue in

density estimation. The small-sample (i.e. non-asymptotic) analysis as to which is the

best kernel has not been presented however, at least to the author’s knowledge, and

hence the kernel choice will have a distinct effect when a limited number of samples

are available. Moreover, saddled with the advantages mentioned earlier, are two more

demerits. The first one is that the choice of bandwidth h is again quite crucial, with a

23

large h producing a high bias and a small h producing a high variance. Also, as per [3]

(Section 3.3.2), the ideal width value for minimizing the mean integrated squared error

between the true and estimated density is itself dependent upon the second derivative

of the (unknown) true density. This result therefore does not give any indication to a

practitioner about what the true bandwidth should be. Hence, the typical method to

estimate a bandwidth is a K -fold cross-validation based approach which turns out to be

both computationally expensive and quite error-prone. Secondly, in many applications,

the domain is bounded. However, the estimates produced by this method yield false

values on the boundary of such domains leading to large localized errors (especially if

kernels with unbounded support are used).

2.1.4 Mixture Models

The mixture model approach to density estimation is also a linear superposition of

kernels, where the number of kernels M is now treated as a modeling parameter [5] and

is usually much less than the total number of samples n. The algebraic expression for

the same is given as follows:

p(x) =

M∑j=1

p(x |j)P(j) (2–14)

where the coefficients P(j) are called the mixing parameters and are the prior

probabilities that a data point was drawn from the j th component, while p(x |j) is the

conditional density that a data point x belonged to the j th component. The class

conditional densities are assumed to be parametric (the most popular model being

the Gaussian). As a result, the mixture model is considered to be ‘semi-parametric’ in

nature.

The priors are of course unknown, and need to be estimated, as also the parameters

of each individual class. The typical parameters for a Gaussian class are the mean µj

and the covariance matrix Σj . The unknown quantities P(j), µj and Σj are inferred

through an expectation maximization framework (starting from the knowledge of

the samples that are available to the user), which is an iterative procedure prone to

24

local minima. The choice of the number of components, i.e. M, is also known to be

quite critical, with a very small value leading to inexpressive density estimates. Large

values for M reduce the efficiency of the mixture model over the simple kernel density

estimator.

2.1.5 Wavelet-Based Density Estimators

These estimators have been introduced relatively recently and are inspired

by the overwhelming success of wavelets in function approximation. An excellent

tutorial introduction to wavelet density estimation exists in [6] and [7], from which the

following material is summarized. Traditionally, a density estimate p(x) (for a true

underlying density p(x)) in this paradigm is expressed in the following manner, as a

linear combination of mother and father wavelet bases (φ(.) and ψ(.) respectively):

p(x) =∑L,k

αL,kφL,k(x) +∞∑j≥L,k

βj ,kψj ,k(x) (2–15)

where αL,k and βj ,k are the coefficients of expansion respectively. Note that the level L

indicates the coarsest scale. The basis functions at a resolution j are expressed in the

following manner:

φjk(x) = 2j/2φ(2jx − k) (2–16)

ψjk(x) = 2j/2ψ(2jx − k). (2–17)

The indices j (or j0) and k are the translation and scale indices respectively. The

coefficients of the entire wavelet expansion are given by the following formulae:

αL,k =

∫ +∞−∞

φL,k(x)p(x)dx (2–18)

βj ,k =

∫ +∞−∞

ψj ,k(x)p(x)dx (2–19)

and in practice are estimated as follows:

αL,k =1

n

n∑i=1

φL,k(xi) (2–20)

25

βj ,k =1

n

n∑i=1

ψj ,k(xi) (2–21)

for a sample set xi (1 ≤ i ≤ n). A practitioner using this paradigm needs to choose

a suitable wavelet kernel (Daubechies, symlets, coiflets, Haar etc.) and even more

critically the maximum level so as to truncate the above infinite expansion. This

maximum level (say L1) decides what is the finest resolution of the expressed density

p(x), and is a model selection issue. Another issue is the thresholding of the wavelet

coefficients after their computation from the given samples. This strategy is adopted

in [8]. The drawbacks of this method are that the estimate subsequent to thresholding

is not guaranteed to be non-negative, making further renormalization necessary. An

interesting method to circumvent this negativity issue is to express the square root of

the density as the aforementioned summation, as opposed to the density itself. In other

words, we now have:

√p(x) =

∑L,k

αL,kφL,k(x) +∞∑j≥L,k

βj ,kψj ,k(x) (2–22)

which upon squaring yields the density estimate p(x) which is now certainly non-negative.

An implicit constraint on the coefficients

∑k

α2L, k +∑j≥L,k

β2j ,k = 1 (2–23)

is now imposed, arising from the fact that∫p(x)dx = 1.

2.2 Marginal and Joint Density Estimation

In this section, we show the derivation of the probability density function (PDF) for

the marginal as well as the joint density for a pair of 2D images. We point out practical

issues and computational considerations, as well as outline the density derivations for

the case of 3D images, as well as multiple images in 2D. The material presented here

26

is taken from the author’s previous publications [9], [10] and [11]1 . The major difference

between the approach presented here and that of all the four techniques described

in previous subsections lies in this: the proposed approach really regards an image

(signal) as an image (a signal) and not a bunch of samples that can be re-arranged

without affecting the density estimate. Therefore essential properties on the signal

(image) can be directly incorporated into the estimation procedure itself.

2.2.1 Estimating the Marginal Densities in 2D

Consider the 2D gray-scale image intensity to be a continuous, scalar-valued

function of the spatial variables, represented as w = I (x , y). Let the total area of the

image be denoted by A. Assume a location random variable Z = (X ,Y ) with a uniform

distribution over the image field of view (FOV). Further, assume a new random variable

W which is a transformation of the random variable Z and with the transformation given

by the gray-scale image intensity functionW = I (X ,Y ). Then the cumulative distribution

ofW at a certain intensity level α is equal to the ratio of the total area of all regions

whose intensity is less than or equal to α to the total area of the image

Pr(W ≤ α) =1

A

∫ ∫I (x ,y)≤α

dxdy . (2–24)

Now, the probability density ofW at α is the derivative of the cumulative distribution

in (2–24). This is equal to the difference in the areas enclosed within two level curves

that are separated by an intensity difference of ∆α (or equivalently, the area enclosed

between two level curves of intensity α and α + ∆α), per unit difference, as ∆α → 0 (see

1 Parts of the content of this and subsequent sections of this chapter have beenreprinted with permission from: A. Rajwade, A. Banerjee and A. Rangarajan, ‘Probabilitydensity estimation using isocontours and isosurfaces: applications to informationtheoretic image registration ’, IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 31, no. 3, pp. 475-491, 2009. c©2009, IEEE

27

Figure 2-1). The formal expression for this is

p(α) =1

Alim∆α→0

∫ ∫I (x ,y)≤α+∆α

dxdy −∫ ∫I (x ,y)≤α

dxdy

∆α. (2–25)

Hence, we have

p(α) =1

A

d

∫ ∫I (x ,y)≤α

dxdy . (2–26)

We can now adopt a change of variables from the spatial coordinates (x , y) to u(x , y)

and I (x , y), where u and I are the directions parallel and perpendicular to the level

curve of intensity α, respectively. Observe that I points in the direction of the image

gradient, or the direction of maximum intensity change. Noting this fact, we now obtain

the following:

p(α) =1

A

∫I (x ,y)=α

∣∣∣∣∣∣∣∂x∂I

∂y∂I

∂x∂u

∂y∂u

∣∣∣∣∣∣∣ du. (2–27)

Note that in Eq. (2–27), dα and dI have “canceled” each other out, as they both

stand for intensity change. After performing a change of variables and some algebraic

manipulations (see Appendix A for the complete derivation), we get the following

expression for the marginal density

p(α) =1

A

∫I (x ,y)=α

du√I 2x + I

2y

. (2–28)

From the above expression, one can make some important observations. Each

point on a given level curve contributes a certain measure to the density at that intensity

which is inversely proportional to the magnitude of the gradient at that point. In other

words, in regions of high intensity gradient, the area between two level curves at nearby

intensity levels would be small, as compared to that in regions of lower image gradient

(see Figure 2-1). When the gradient value at a point is zero (owing to the existence of a

peak, a valley, a saddle point or a flat region), the contribution to the density at that point

tends to infinity. (The practical repercussions of this situation are discussed later on in

the paper. Lastly, the density at an intensity level can be estimated by traversing the

28

level curve(s) at that intensity and integrating the reciprocal of the gradient magnitude.

One can obtain an estimate of the density at several intensity levels (at intensity spacing

of h from each other) across the entire intensity range of the image.

2.2.2 Related Work

A similar density estimator has also been developed by another group of researchers

[12], completely independently of this work. Their density estimator is motivated

exclusively by random variable transformations and does not incorporate the notion

of level sets. Furthermore, apart from differences in the derivation of the results, there

are differences in implementation. Moreover the applications they have targeted are

mainly image segmentation, particularly in the biomedical domain [13]. Similar notions

of densities obtained from random variable transformations have been mentioned in [14]

in the context of histogram preserving continuous transformations, with applications to

studying different projections of 3D models. However, in their actual implementation,

only digital samples are used, and there is no notion of any joint statistics. The density

estimator presented in this thesis was specifically developed in the context of an image

registration application (more about this in Chapter 3), and has been extended for

various special cases such as images defined in 3D, two or more than two images in

2D, and biased density estimators in 2D as well as 3D (as will been seen in subsequent

sections of this chapter).

2.2.3 Other Methods for Derivation

There exist at least two other methods of deriving the expression above, which are

discussed below.

1. Using Dirac-delta functions: The Dirac-delta function (with its domain being thereal line) is defined as follows:

δ(x) = +∞( if x = 0) (2–29)= 0( if x 6= 0)

29

in such a way that ∫ +∞−∞

δ(x)dx = 1. (2–30)

The delta function has analogous definitions in higher dimensions. It is awell-known property of the delta function (in any dimension) that∫ +∞

−∞f (~x)δ(I (~x))d~x =

∫I−1(0)

f (~x)du

|∇I (~x)|. (2–31)

Setting f (~x) to be unity throughout and considering that I (~x) is the image function,it is easy to see that

p(I (~x) = α) =

∫δ(I (~x)− α)dx =

∫I−1(0)

du

|∇I (~x)|. (2–32)

2. An intuitive geometric approach: Again consider the 2D gray-scale imageintensity to be a continuous, scalar-valued function of the spatial variables,represented as z = I (x , y). Assuming locations are iid, the cumulative distributionat a certain intensity level α can be written as follows:

Pr(z < α) =1

A

∫∫z<α

dxdy . (2–33)

Now, the probability density at α is the derivative of the cumulative distribution.This is equal to the difference in the areas enclosed within two level curves thatare separated by an intensity difference of ∆α (or equivalently, the area enclosedbetween two level curves of intensity α and α + ∆α), per unit difference, as∆α → 0 (see Figure (2-1)). At every location (x , y) along the level curve at α,the perpendicular distance (in terms of spatial coordinates) to the level curve atα + ∆α is given as ∆α

g(x ,y)where g(x , y) stands for the magnitude of the intensity

gradient at (x , y). Hence the total area enclosed between the two level curves canbe calculated as this distance integrated all along the contour at α. Denoting thetangent to the level curve as u, and taking the limit as ∆α → 0, we obtain the sameexpression.

2.2.4 Estimating the Joint Density

Consider two images represented as continuous scalar valued functions w1 =

I1(x , y) and w2 = I2(x , y), whose overlap area is A. As before, assume a location

random variable Z = X ,Y with a uniform distribution over the (overlap) field of view.

Further, assume two new random variablesW1 andW2 which are transformations of

the random variable Z and with the transformations given by the gray-scale image

intensity functionsW1 = I1(X ,Y ) andW2 = I2(X ,Y ). Let the set of all regions whose

30

intensity in I1 is less than or equal to α1 and whose intensity in I2 is less than or equal

to α2 be denoted by L. The cumulative distribution Pr(W1 ≤ α1,W2 ≤ α2) at intensity

values (α1,α2) is equal to the ratio of the total area of L to the total overlap area A. The

probability density p(α1,α2) in this case is the second partial derivative of the cumulative

distribution w.r.t. α1 and α2. Consider a pair of level curves from I1 having intensity

values α1 and α1 + ∆α1, and another pair from I2 having intensity α2 and α2 + ∆α2. Let

us denote the region enclosed between the level curves of I1 at α1 and α1 + ∆α1 as Q1

and the region enclosed between the level curves of I2 at α2 and α2 + ∆α2 as Q2. Then

p(α1,α2) can geometrically be interpreted as the area of Q1 ∩ Q2, divided by ∆α1∆α2,

in the limit as ∆α1 and ∆α2 tend to zero. The regions Q1, Q2 and also Q1 ∩ Q2 (dark

black region) are shown in Figure 2-2(left). Using a technique very similar to that shown

in Eqs. (2–25)-(2–27), we obtain the expression for the joint cumulative distribution as

follows:

Pr(W1 ≤ α1,W2 ≤ α2) =1

A

∫ ∫L

dxdy . (2–34)

By doing a change of variables, we arrive at the following formula:

Pr(W1 ≤ α1,W2 ≤ α2) =1

A

∫ ∫L

∣∣∣∣∣∣∣∂x∂u1

∂y∂u1

∂x∂u2

∂y∂u2

∣∣∣∣∣∣∣ du1du2. (2–35)

Here u1 and u2 represent directions along the corresponding level curves of the two

images I1 and I2. Taking the second partial derivative with respect to α1 and α2, we get

the expression for the joint density:

p(α1,α2) =1

A

∂2

∂α1∂α2

∫ ∫L

∣∣∣∣∣∣∣∂x∂u1

∂y∂u1

∂x∂u2

∂y∂u2

∣∣∣∣∣∣∣ du1du2. (2–36)

It is important to note here again, that the joint density in (2–36) may not exist

because the cumulative may not be differentiable. Geometrically, this occurs if (a) both

31

the images have locally constant intensity, (b) if only one image has locally constant

intensity, or (c) if the level sets of the two images are locally parallel. In case (a), we

have area-measures and in the other two cases, we have curve-measures. These cases

are described in detail in the following section, but for the moment, we shall ignore these

degeneracies.

To obtain a complete expression for the PDF in terms of gradients, it would be

highly intuitive to follow purely geometric reasoning. One can observe that the joint

probability density p(α1,α2) is the sum total of “contributions” at every intersection

between the level curves of I1 at α1 and those of I2 at α2. Each contribution is the

area of parallelogram ABCD [see Figure 2-2(right)] at the level curve intersection, as

the intensity differences ∆α1 and ∆α2 shrink to zero. (We consider a parallelogram

here, because we are approximating the level curves locally as straight lines.) Let the

coordinates of the point B be (x , y) and the magnitude of the gradient of I1 and I2 at

this point be g1(x , y) and g2(x , y). Also, let θ(x , y) be the acute angle between the

gradients of the two images at B. Observe that the intensity difference between the

two level curves of I1 is ∆α1. Then, using the definition of gradient, the perpendicular

distance between the two level curves of I1 is given as ∆α1g1(x ,y)

. Looking at triangle CDE

(wherein CE is perpendicular to the level curves) we can now deduce the length of

CD (or equivalently that of AB). Similarly, we can also find the length CB. The two

expressions are given by:

|AB| = ∆α1g1(x , y) sin θ(x , y)

, |CB| = ∆α2g2(x , y) sin θ(x , y)

. (2–37)

Now, the area of the parallelogram is equal to

|AB||CB| sin θ(x , y) (2–38)

=∆α1∆α2

g1(x , y)g2(x , y) sin θ(x , y).

32

With this, we finally obtain the following expression for the joint density:

p(α1,α2) =1

A

∑C

1

g1(x , y)g2(x , y) sin θ(x , y)(2–39)

where the set C represents the (countable) locus of all points where I1(x , y) = α1

and I2(x , y) = α2. It is easy to show through algebraic manipulations that Eqs. (2–36)

and (2–39) are equivalent formulations of the joint probability density p(α1,α2). These

results could also have been derived purely by manipulation of Jacobians (as done

while deriving marginal densities), and the derivation for the marginals could also have

proceeded following geometric intuitions.

The formula derived above tallies beautifully with intuition in the following ways.

Firstly, the area of the parallelogram ABCD (i.e. the joint density contribution) in regions

of high gradient [in either or both image(s)] is smaller as compared to that in the case of

regions with lower gradients. Secondly, the area of parallelogram ABCD (i.e. the joint

density contribution) is the least when the gradients of the two images are orthogonal

and maximum when they are parallel or coincident [see Figure 2-3(a)]. In fact, the

joint density tends to infinity in the case where either (or both) gradient(s) is (are)

zero, or when the two gradients align, so that sin θ is zero. The repercussions of this

phenomenon are discussed in the following section.

2.2.5 From Densities to Distributions

In the two preceding sub-sections, we observed the divergence of the marginal

density in regions of zero gradient, or of the joint density in regions where either (or

both) image gradient(s) is (are) zero, or when the gradients locally align. The gradient

goes to zero in regions of the image that are flat in terms of intensity, and also at peaks,

valleys and saddle points on the image surface. We can ignore the latter three cases

as they are a finite number of points within a continuum. The probability contribution

at a particular intensity in a flat region is proportional to the area of that flat region.

Some ad hoc approaches could involve simply “weeding out” the flat regions altogether,

33

but that would require the choice of sensitive thresholds. The key thing is to notice

that in these regions, the density does not exist but the probability distribution does.

So, we can switch entirely to probability distributions everywhere by introducing a

non-zero lower bound on the “values” of ∆α1 and ∆α2. Effectively, this means that

we always look at parallelograms representing the intersection between pairs of level

curves from the two images, separated by non-zero intensity difference, denoted

as, say, h. Since these parallelograms have finite areas, we have circumvented the

situation of choosing thresholds to prevent the values from becoming unbounded,

and the probability at α1,α2, denoted as p(α1,α2) is obtained from the areas of such

parallelograms. We term this area-based method of density estimation as AreaProb.

Later on in the paper, we shall show that the switch to distributions is principled and

does not reduce our technique to standard histogramming in any manner whatsoever.

The notion of an image as a continuous entity is one of the pillars of our approach.

We adopt a locally linear formulation in this paper, for the sake of simplicity, though

the technical contributions of this paper are in no way tied to any specific interpolant.

For each image grid point, we estimate the intensity values at its four neighbors within

a horizontal or vertical distance of 0.5 pixels. We then divide each square defined by

these neighbors into a pair of triangles. The intensities within each triangle can be

represented as a planar patch, which is given by the equation z1 = A1x + B1y + C1 in

I1. Iso-intensity lines at levels α1 and α1 + h within this triangle are represented by the

equations A1x +B1y +C1 = α1 and A1x +B1y +C1 = α1+ h (likewise for the iso-intensity

lines of I2 at intensities α2 and α2 + h, within a triangle of corresponding location). The

contribution from this triangle to the joint probability at (α1,α2), i.e. p(α1,α2) is the

area bounded by the two pairs of parallel lines, clipped against the body of the triangle

itself, as shown in Figure 2-7. In the case that the corresponding gradients from the two

images are parallel (or coincident), they enclose an infinite area between them, which

when clipped against the body of the triangle, yields a closed polygon of finite area, as

34

shown in Figure 2-7. When both the gradients are zero (which can be considered to be a

special case of gradients being parallel), the probability contribution is equal to the area

of the entire triangle. In the case where the gradient of only one of the images is zero,

the contribution is equal to the area enclosed between the parallel iso-intensity lines

of the other image, clipped against the body of the triangle (see Figure 2-7). Observe

that though we have to treat pathological regions specially (despite having switched to

distributions), we now do not need to select thresholds, nor do we need to deal with a

mixture of densities and distributions. The other major advantage is added robustness to

noise, as we are now working with probabilities instead of their derivatives, i.e. densities.

The issue that now arises is how the value of h may be chosen. It should be

noted that although there is no “optimal” h, our density estimate would convey more

and more information as the value of h is reduced (in complete contrast to standard

histogramming). In Figure 2-5, we have shown plots of our joint density estimate and

compared it to standard histograms for P equal to 16, 32, 64 and 128 bins in each

image (i.e. 322, 642 etc. bins in the joint), which illustrate our point clearly. We found

that the standard histograms had a far greater number of empty bins than our density

estimator, for the same number of intensity levels. The corresponding marginal discrete

distributions for the original retinogram image [1] for 16, 32, 64 and 128 bins are shown

in Figure 2-6.

2.2.6 Joint Density between Multiple Images in 2D

For the simultaneous registration of multiple (d > 2) images, the use of a single

d-dimensional joint probability has been advocated in previous literature [15], [16]. Our

joint probability derivation can be easily extended to the case of d > 2 images by using

similar geometric intuition to obtain the polygonal area between d intersecting pairs of

level curves [see Figure 2-3(right) for the case of d = 3 images]. Note here that the

d-dimensional joint distribution lies essentially in a 2D subspace, as we are dealing

with 2D images. A naıve implementation of such a scheme has a complexity of O(NPd)

35

where P is the number of intensity levels chosen for each image and N is the size of

each image. Interestingly, however, this exponential cost can be side-stepped by first

computing the at most (d(d−1)2)P2 points of intersection between pairs of level curves

from all d images with one another, for every pixel. Secondly, a graph can be created,

each of whose nodes is an intersection point. Nodes are linked by edges labeled with

the image number (say k th image) if they lie along the same iso-contour of that image. In

most cases, each node of the graph will have a degree of four (and in the unlikely case

where level curves from all images are concurrent, the maximal degree of a node will

be 2d). Now, this is clearly a planar graph, and hence, by Euler’s formula, we have the

number of (convex polygonal) faces F = d(d−1)2

∗ 4P2 − d(d−1)2P2 + 2 = O(P2d2), which

is quadratic in the number of images. The area of the polygonal faces are contributions

to the joint probability distribution. In a practical implementation, there is no requirement

to even create the planar graph. Instead, we can implement a simple incremental

face-splitting algorithm ([17], section 8.3). In such an implementation, we create a list of

faces F which is updated incrementally. To start with, F consists of just the triangular

face constituting the three vertices of a chosen half-pixel in the image. Next, we consider

a single level-line l at a time and split into two any face in F that l intersects. This

procedure is repeated for all level lines (separated by a discrete intensity spacing) of all

the d images. The final output is a listing of all polygonal faces F created by incremental

splitting which can be created in just O(FPd) time. The storage requirement can be

made polynomial by observing that for d images, the number of unique intensity tuples

will be at most FN in the worst case (as opposed to Pd ). Hence all intensity tuples can

be efficiently stored and indexed using a hash table.

2.2.7 Extensions to 3D

When estimating the probability density from 3D images, the choice of an optimal

smoothing parameter is a less critical issue, as a much larger number of samples

are available. However, at a theoretical level this still remains a problem, which would

36

worsen in the multiple image case. In 3D, the marginal probability can be interpreted as

the total volume sandwiched between two iso-surfaces at neighboring intensity levels.

The formula for the marginal density p(α) of a 3D image w = I (x , y , z) is given as

follows:

p(α) =1

V

d

∫ ∫ ∫I (x ,y ,z)≤α

dxdydz . (2–40)

Here V is the volume of the image I (x , y , z). We can now adopt a change of variables

from the spatial coordinates x , y and z to u1(x , y , z), u2(x , y , z) and I (x , y , z), where I

is the perpendicular to the level surface (i.e. parallel to the gradient) and u1 and u2 are

mutually perpendicular directions parallel to the level surface. Noting this fact, we now

obtain the following:

p(α) =1

V

∫ ∫I (x ,y ,z)=α

∣∣∣∣∣∣∣∣∣∣∂x∂I

∂y∂I

∂z∂I

∂x∂u1

∂y∂u1

∂z∂u1

∂x∂u2

∂y∂u2

∂z∂u2

∣∣∣∣∣∣∣∣∣∣du1du2. (2–41)

Upon a series of algebraic manipulations just as before, we are left with the following

expression for p(α):

p(α) =1

V

∫ ∫I (x ,y ,z)=α

du1du2√( ∂I∂x)2 + ( ∂I

∂y)2 + ( ∂I

∂z)2. (2–42)

For the joint density case, consider two 3D images represented as w1 = I1(x , y , z)

and w2 = I2(x , y , z), whose overlap volume (the field of view) is V . The cumulative

distribution Pr(W1 ≤ α1,W2 ≤ α2) at intensity values (α1,α2) is equal to the ratio of

the total volume of all regions whose intensity in the first image is less than or equal

to α1 and whose intensity in the second image is less than or equal to α2, to the total

image volume. The probability density p(α1,α2) is again the second partial derivative

of the cumulative distribution. Consider two regions R1 and R2, where R1 is the region

trapped between level surfaces of the first image at intensities α1 and α1 + ∆α1, and R2

is defined analogously for the second image. The density is proportional to the volume

37

of the intersection of R1 and R2 divided by ∆α1 and ∆α2 when the latter two tend to zero.

It can be shown through some geometric manipulations that the area of the base of

the parallelepiped formed by the iso-surfaces is given as ∆α1∆α2| ~g1× ~g2| =∆α1∆α2

|g1g2 sin(θ)| , where ~g1

and ~g2 are the gradients of the two images, and θ is the angle between them. Let ~h be

a vector which points in the direction of the height of the parallelepiped (parallel to the

base normal, i.e. ~g1 × ~g2), and d~h be an infinitesimal step in that direction. Then the

probability density is given as follows:

p(α1,α2) =1

V

∂2

∂α1∂α2

∫ ∫ ∫Vs

dxdydz

=1

V

∂2

∂α1∂α2

∫ ∫ ∫Vs

d ~u1d ~u2d~h

|~g1 × ~g2|=1

V

∫C

d~h

|~g1 × ~g2|. (2–43)

In Eq. (2–43), ~u1 and ~u2 are directions parallel to the iso-surfaces of the two images, and

~h is their cross-product (and parallel to the line of intersection of the individual planes),

while C is the 3D space curve containing the points where I1 and I2 have values α1 and

α2 respectively and Vsdef= (x , y , z) : I1(x , y , z) ≤ α1, I2(x , y , z) ≤ α2.

2.2.8 Implementation Details for the 3D case

The density formulation for the 3D case suffers from the same problem of

divergence to infinity, as in the 2D case. Similar techniques can be employed, this

time using level surfaces that are separated by finite intensity gaps. To trace the level

surfaces, each cube-shaped voxel in the 3D image can be divided into 12 tetrahedra.

The apex of each tetrahedron is located at the center of the voxel and the base is

formed by dividing one of the six square faces of the cube by one of the diagonals of

that face [see Figure 2-8(a)]. Within each triangular face of each such tetrahedron, the

intensity can be assumed to be a linear function of location. Note that the intensities

in different faces of one and the same tetrahedron can thus be expressed by different

functions, all of them linear. Hence the iso-surfaces at different intensity levels within a

single tetrahedron are non-intersecting but not necessarily parallel. These level surfaces

at any intensity within a single tetrahedron turn out to be either triangles or quadrilaterals

38

in 3D. This interpolation scheme does have some bias in the choice of the diagonals

that divide the individual square faces. A scheme that uses 24 tetrahedra with the apex

at the center of the voxel, and four tetrahedra based on every single face, has no bias

of this kind [see Figure 2-8(b)]. However, we still used the former (and faster) scheme

as it is simpler and does not noticeably affect the results. Level surfaces are again

traced at a finite number of intensity values, separated by equal intensity intervals. The

marginal density contributions are obtained as the volumes of convex polyhedra trapped

in between consecutive level surfaces clipped against the body of individual tetrahedra.

The joint distribution contribution from each voxel is obtained by finding the volume of

the convex polyhedron resulting from the intersection of corresponding convex polyhedra

from the two images, clipped against the tetrahedra inside the voxel. We refer to this

scheme of finding joint densities as VolumeProb.

2.2.9 Joint Densities by Counting Points and Measuring Lengths

For the specific case of registration of two images in 2D, we present another

method of density estimation. This method, which was presented by us earlier in [10],

is a biased estimator that does not assume a uniform distribution on location. In this

technique, the total number of co-occurrences of intensities α1 and α2 from the two

images respectively, is obtained by counting the total number of intersections of the

corresponding level curves. Each half-pixel can be examined to see whether level

curves of the two images at intensities α1 and α2 can intersect within the half-pixel.

This process is repeated for different (discrete) values from the two images (α1 and α2),

separated by equal intervals and selected a priori (see Figure 2-9). The co-occurrence

counts are then normalized so as to yield a joint probability mass function (PMF). We

denote this method as 2DPointProb. The marginals are obtained by summing up the

joint PMF along the respective directions. This method, too, avoids the histogramming

binning problem as one has the liberty to choose as many level curves as desired.

However, it is a biased density estimator because more points are picked from regions

39

with high image gradient. This is because more level curves (at equi-spaced intensity

levels) are packed together in such areas. It can also be regarded as a weighted version

of the joint density estimator presented in the previous sub-section, with each point

weighted by the gradient magnitudes of the two images at that point as well as the sine

of the angle between them. Thus the joint PMF by this method is given as

p(α1,α2) =∂2

∂α1∂α2

1

K

∫ ∫D

g1(x , y)g2(x , y) sin θ(x , y)dxdy (2–44)

where D denotes the regions where I1(x , y) ≤ α1, I2(x , y) ≤ α2 and K is a normalization

constant. This simplifies to the following:

p(α1,α2) =1

K

∑C

1. (2–45)

Hence, we have p(α1,α2) =|C |K

, where C is the (countable) set of points where

I1(x , y) = α1 and I2(x , y) = α2. The marginal (biased) density estimates can be regarded

as lengths of the individual iso-contours. With this notion in mind, the marginal density

estimates are seen to have a close relation with the total variation of an image, which

is given by TV =∫I=α

|∇I (x , y)|dxdy [18]. We clearly have TV =∫I=αdu, by doing

the same change of variables (from x , y to u, I ) as in Eqs. (2–27) and (2–28), thus

giving us the length of the iso-contours at any given intensity level. In 3D, we consider

the segments of intersection of two iso-surfaces and calculate their lengths, which

become the PMF contributions. We refer to this as LengthProb [see Figure 2-10(a)].

Both 2DPointProb and LengthProb, however, require us to ignore those regions in which

level sets do not exist because the intensity function is flat, or those regions where level

sets from the two images are parallel. The case of flat regions in one or both images can

be fixed to some extent by slight blurring of the image. The case of aligned gradients

is trickier, especially if the two images are in complete registration. However, in the

multi-modality case or if the images are noisy/blurred, perfect registration is a rare

occurrence, and hence perfect alignment of level surfaces will rarely occur.

40

To summarize, in both these techniques, location is treated as a random vari-

able with a distribution that is not uniform, but instead peaked at (biased towards)

locations where specific features of the image itself (such as gradients) have large

magnitudes or where gradient vectors from the two images are closer towards being

perpendicular than parallel. Such a bias towards high gradients is principled, as these

are the more salient regions of the two images. Empirically, we have observed that

both these density estimators work quite well on affine registration, and that Length-

Prob is more than 10 times faster than VolumeProb. This is because the computation

of segments of intersection of planar iso-surfaces is much faster than computing

polyhedron intersections. Joint PMF plots for histograms and LengthProb for 128 bins

and 256 bins are shown in Figure 2-11.

There exists one more major difference between AreaProb and VolumeProb on

one hand, and LengthProb or 2DPointProb on the other. The former two can be easily

extended to compute joint density between multiple images (needed for co-registration

of multiple images using measures such as modified mutual information (MMI) [15]). All

that is required is the intersection of multiple convex polyhedra in 3D or multiple convex

polygons in 2D (see Section 2.2.6). However, 2DPointProb is strictly applicable to the

case of the joint PMF between exactly two images in 2D, as the problem of intersection

of three or more level curves at specific (discrete) intensity levels is over-constrained.

In 3D, LengthProb also deals with strictly two images only, but one can extend the

LengthProb scheme to also compute the joint PMF between exactly three images. This

can be done by making use of the fact that three planar iso-surfaces intersect in a point

(excepting degenerate cases) [see Figure 2-10(b)]. The joint PMFs between the three

images are then computed by counting point intersections. We shall name this method

as 3DPointProb. The differences between all the aforementioned methods: AreaProb,

2DPointProb, VolumeProb, LengthProb and 3DPointProb are summarized in Table 2-1

for quick reference. It should be noted that 2DPointProb, LengthProb and 3DPointProb

41

Level Curve at α+∆α

Level curve at α

Area between level curves

Figure 2-1. p(α) ∝ area between level curves at α and α+∆α (i.e. region with red dots)

compute PMFs, whereas AreaProb and VolumeProb compute cumulative measures

over finite intervals.

2.3 Experimental Results: Area-Based PDFs Versus Histograms with SeveralSub-Pixel Samples

The accuracy of the histogram estimate will no doubt approach the true PDF as

the number of samples Ns (drawn from sub-pixel locations) tends to infinity. However,

we wish to point out that our method implicitly and efficiently considers every point as

a sample, thereby constructing the PDF directly, i.e. the accuracy of what we calculate

with the area-based method will always be an upper bound on the accuracy yielded by

any sample-based approach, under the assumption that the true interpolant is known to

us. We show here an anecdotal example for the same, in which the number of histogram

samples Ns is varied from 5000 to 2 × 109. The L1 and L2 norms of the difference

between the joint PDF of two 90 x 109 images (down-sampled MR-T1 and MR-T2 slices

obtained from Brainweb [19]) as computed by our method and that obtained by the

histogram method, as well as the Jensen-Shannon divergence (JSD) between the two

joint PDFs, are plotted in the figures below versus logNs (see Figure 2-12). The number

of bins used was 128× 128 (i.e. h = 128). Visually, it was observed that the joint density

surfaces begin to appear ever more similar as Ns increases. The timing values for the

joint PDF computation are shown in Table 2-2, clearly showing the greater efficiency of

our method.

42

Level Curves of Image 1at levels α1 and α1+∆α1

Level Curves of Image 2at levels α2 and α2+∆α2

Region P

Region Q

Intersection of P and Q

A

A

D C

B

Level Curves of I1at α1 and α1+∆α1

The level curves of I1 and I2make an angle θ w.r.t. each other

E

Level Curves of I2at α2 and α2+∆α2

length(CE) = ∆α1/g1(x,y);intensity spacing = ∆α1

B

Figure 2-2. (A) Intersection of level curves of I1 and I2: p(α1,α2) ∝ area of dark blackregions. (B) Parallelogram approximation: PDF contribution = area (ABCD)

A

Level Curves of I1

Level Curves of I2

Level Curves of I3

B

Figure 2-3. (A) Area of parallelogram increases as angle between level curvesdecreases (left to right). Level curves of I1 and I2 are shown in red and bluelines respectively (B) Joint probability contribution in the case of threeimages

Table 2-1. Comparison between different methods of density estimation w.r.t. nature ofdomain, bias, speed, and geometric nature of density contributions

Method 2D/3D Density Contr. Bias No. of imagesAreaProb 2D Area No Any

VolumeProb 3D Volume No AnyLengthProb 3D Length Yes 2 only2DPointProb 2D Point count Yes 2 only3DPointProb 3D Point count Yes 3 only

43

A B

Figure 2-4. A retinogram [1] and its rotated negative

0

10

20

0

10

200

0.02

0.04

0.06

Joint PDF 16 bins (using simple hist.)

A

0

20

40

0

20

400

0.005

0.01

0.015

Joint PDF 32 bins (using simple hist.)

B

0

50

100

0

50

1000

1

2

3

4x 10

−3

Joint PDF 64 bins (using simple hist.)

C

050

100150

0

50

100

1500

0.5

1

1.5

2x 10

−3

Joint PDF 128 bins (using simple hist.)

D

0

10

20

0

10

200

0.02

0.04

0.06

Joint PDF 16 bins (using isocontours)

E

0

20

40

0

20

400

0.005

0.01

0.015

Joint PDF 32 bins (using isocontours)

F

0

50

100

0

50

1000

1

2

3

4x 10

−3

Joint PDF 64 bins (using isocontours)

G

050

100150

0

50

100

1500

0.5

1x 10

−3

Joint PDF 128 bins (using isocontours)

H

Figure 2-5. Following left to right and top to bottom, joint densities of the retinogramimages computed by histograms (using 16, 32, 64, 128 bins) and by ourarea-based method (using 16, 32, 64 and 128 bins)

44

0 5 10 15 200

0.05

0.1

0.15

0.2

0.25Marginal PDF 16 bins (using simple hist.)

A0 10 20 30 40

0

0.02

0.04

0.06

0.08

0.1

0.12Marginal PDF 32 bins (using simple hist.)

B0 20 40 60 80

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07Marginal PDF 64 bins (using simple hist.)

C

0 50 100 1500

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04Marginal PDF 128 bins (using simple hist.)

D0 5 10 15 20

0

0.05

0.1

0.15

0.2

0.25Marginal PDF 16 bins (using isocontours.)

E0 10 20 30 40

0

0.02

0.04

0.06

0.08

0.1

0.12Marginal PDF 32 bins (using isocontours.)

F

0 20 40 60 800

0.01

0.02

0.03

0.04

0.05

0.06Marginal PDF 64 bins (using isocontours.)

G0 50 100 150

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035Marginal PDF 128 bins (using isocontours.)

H

Figure 2-6. Marginal densities of the retinogram image computed by histograms [from(A) to (D)] and our area-based method [from (E) to (H)] using 16, 32, 64 and128 bins (row-wise order)

Table 2-2. Timing values for computation of joint PDFs and L1 norm of differencebetween PDF computed by sampling with that computed using iso-contours;Number of bins is 128× 128, size of images 122× 146

Method Time (secs.) Diff. with iso-contour PDFIso-contours 5.1 0

Hist. 106 samples 1 0.0393Hist. 107 samples 11 0.01265Hist. 108 samples 106 0.0039

Hist. 5× 108 samples 450 0.00176Hist. 2× 109 samples 1927 8.58× 10−4

45

INFINITY

Figure 2-7. Left: Probability contribution equal to area of parallelogram between levelcurves clipped against the triangle, i.e. half-pixel. Middle: Case of parallelgradients. Right: Case when the gradient of one image is zero (blue levellines) and that of the other is non-zero (red level lines). In each case,probability contribution equals area of the dark black region

Center ofvoxel

Face of one ofthe tetrahedra

A

Center ofvoxel

Each square face is the baseof four tetrahedra

B

Figure 2-8. Splitting a voxel (A) into 12 tetrahedra, two on each of the six faces of thevoxel; and (B) into 24 tetrahedra, four on each of the six faces of the voxel

46

Neighbors ofgrid point

Pixel grid point

Square divided intotwo triangles

Iso-intensity line of I1 at α1

Iso-intensity line of I2 at α2

A vote for p(α1,α2)

A vote for p(α1+∆,α2+∆)

Figure 2-9. Counting level curve intersections within a given half-pixel

Line of intersectionof two planes

Planar Isosurfacesfrom the two images

A

Line of intersectionof two planes

Planar Isosurfacesfrom the three images

Point of intersection ofthree planes

B

Figure 2-10. Biased estimates in 3D: (A) Segment of intersection of planar iso-surfacesfrom the two images, (B) Point of intersection of planar iso-surfaces fromthe three images (each in a different color)

47

0

50

100

150

0

50

100

1500

0.5

1

1.5

2

2.5

x 10−3

Joint PDF using simple hist. (128 bins)

A B

0

50

100

150

0

50

100

1500

1

2

3

4

5

x 10−3

Joint PDF using LengthProb (128 bins)

C D

Figure 2-11. Joint probability plots using: (A) histograms, 128 bins, (B) histograms, 256bins, (C) LengthProb, 128 bins and (D) LengthProb, 256 bins

48

A

8 10 12 14 16 18 20 220

0.1

0.2

0.3

0.4

0.5

0.6

0.7

L1 norm of difference betn. true and est. PDF vs. log Ns

B

8 10 12 14 16 18 20 220

1

2

x 10−4

L2 norm of difference betn. true and est. PDF vs. log Ns

C

8 10 12 14 16 18 20 220

0.02

0.04

0.06

0.08

0.1

0.12

JSD betn. true and est. PDF vs. log Ns

D

Figure 2-12. Plots of the difference between the joint PDF (of the images in subfigure[A]) computed by the area-based method and by histogramming with Nssub-pixel samples versus logNs using (B) L1 norm, (C) L2 norm, and (D)JSD

49

CHAPTER 3APPLICATION TO IMAGE REGISTRATION

3.1 Entropy Estimators in Image Registration

Information theoretic tools have for a long time been established as the de facto

technique for image registration, especially in the domains of medical imaging [20] and

remote sensing [21] which deal with a large number of modalities. The ground-breaking

work for this was done by Viola and Wells [22], and Maes et al. [23] in their widely cited

papers1 . A detailed survey of subsequent research on information theoretic techniques

in medical image registration is presented in the works of Pluim et al. [20] and Maes

et al. [24]. A required component of all information theoretic techniques in image

registration is a good estimator of the joint entropies of the images being registered.

Most techniques employ plug-in entropy estimators, wherein the joint and marginal

probability densities of the intensity values in the images are first estimated and these

quantities are then used to obtain the entropy. There also exist recent methods which

define a new form of entropy using cumulative distributions instead of probability

densities (see [25], [26] and [27]). Furthermore, there also exist techniques which

directly estimate the entropy, without estimating the probability density or distribution as

an intermediate step [28]. Below, we present a bird’s eye view of these techniques and

their limitations. Subsequently, we introduce our method and bring out its salient merits.

The plug-in entropy estimators rely upon techniques for density estimation as a

key first step. The most popular density estimators are the simple image histogram and

the Parzen window. The latter have been widely employed as a differentiable density

estimator for image registration in [22].The problems associated with these estimators

1 Parts of the contents of this chapter have been reprinted with permission from:A. Rajwade, A. Banerjee and A. Rangarajan, ‘Probability density estimation usingisocontours and isosurfaces: applications to information theoretic image registration’, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 3, pp.475-491, 2009. c©2009, IEEE

50

have been discussed in the previous chapter. The kernel width parameter in Parzen

windows can be estimated by techniques such as maximum likelihood (see Section

3.3.1 of [29]). Such methods, however, require complicated iterative optimizations,

and also a training and validation set. From an image registration standpoint, the joint

density between the images undergoes a change in each iteration, which requires

re-estimation of the kernel width parameters. This step is an expensive iterative process

with a complexity that is quadratic in the number of samples. Methods such as the fast

Gauss transform [30] reduce this cost to some extent but they require a prior clustering

step. Also, the fast Gauss transform is only an approximation to the true Parzen density

estimate, and hence, one needs to analyze the behavior of the approximation error over

the iterations if a gradient-based optimizer is used. Yet another drawback of Parzen

window based density estimators is the well-known “tail effect” in higher dimensions,

due to which a large number of samples will fall in those regions where the Gaussian

has very low value [3]. Mixture models have been used for joint density estimation in

registration [31], but they are quite inefficient and require choice of the kernel function

for the components (usually chosen to be Gaussian) and the number of components.

This number again will change across the iterations of the registration process, as the

images move with respect to one another. Wavelet based density estimators have also

been recently employed in image registration [32] and in conjunction with MI [7]. The

problems with a wavelet based method for density estimation include a choice of wavelet

function, as well as the selection of the optimal number of levels, which again requires

iterative optimization.

Direct entropy estimators avoid the intermediate density estimation phase. While

there exists a plethora of papers in this field (surveyed in [28]), the most popular entropy

estimator used in image registration is the approximation of the Renyi entropy as the

weight of a minimal spanning tree [33] or a K -nearest neighbor graph [34]. Note that

the entropy used here is the Renyi entropy as opposed to the more popular Shannon

51

entropy. Drawbacks of this approach include the computational cost in construction of

the data structure in each step of registration (the complexity whereof is quadratic in the

number of samples drawn), the somewhat arbitrary choice of the α parameter for the

Renyi entropy and the lack of differentiability of the cost function. Some work has been

done recently, however, to introduce differentiability in the cost function [35]. A merit

of these techniques is the ease of estimation of entropies of high-dimensional feature

vectors, with the cost scaling up just linearly with the dimensionality of the feature space.

Recently, a new form of the entropy defined on cumulative distributions, and

related cumulative entropic measures such as cross cumulative residual entropy

(CCRE) have been introduced in the literature on image registration [25], [26], [27].

The cumulative entropy and the CCRE measure have perfectly compatible discrete

and continuous versions (quite unlike the Shannon entropy, though not unlike the

Shannon mutual information), and are known to be noise resistant (as they are defined

on cumulative distributions and not densities). Our method of density estimation can be

easily extended to computing cumulative distributions and CCRE.

All the techniques reviewed here are based on different principles, but have

one crucial common point: they treat the image as a set of pixels or samples, which

inherently ignores the fact that these samples originate from an underlying continuous

(or piece-wise continuous) signal. None of these techniques take into account the

ordering between the given pixels of an image. As a result, all these methods can be

termed sample-based. Furthermore, most of the aforementioned density estimators

require a particular kernel, the choice of which is extrinsic to the image being analyzed

and not necessarily linked even to the noise model. In this chapter, we employ our

density estimator discussed in the previous chapter. Our approach here is based on

the author’s earlier work presented in [9] and [11] (the essence of which is to regard

the marginal probability density as the area between two iso-contours at infinitesimally

close intensity values) and in [10] (using biased density estimators for registration).

52

Other prior work on image registration using such image based techniques includes

[36] and [37]. The work in [36], however, reports results only on template matching with

translations, whereas the main focus of [37] is on estimation of densities in vanishingly

small circular neighborhoods. The formulae derived are very specific to the shape

of the neighborhood. Their paper [37] shows that local mutual information values in

small neighborhoods are related to the values of the angles between the local gradient

vectors in those neighborhoods. The focus of this method, however is too local in nature,

thereby ignoring the robustness that is an integral part of more global density estimates.

Note that our method, based on finding areas between iso-contours, is significantly

different from Partial Volume Interpolation (PVI) [23], [38]. PVI uses a continuous

image representation to build a joint probability table by assigning fractional votes to

multiple intensity pairs when a digital image is warped during registration. The fractional

votes are assigned typically using a bilinear or bicubic kernel function in cases of

non-alignment with pixel grids after image warping. In essence, the density estimate in

PVI still requires histogramming or Parzen windowing.

The main merit of the proposed geometric technique is the fact that it side-steps

the parameter selection problem that affects other density estimators and also does not

rely on any form of sampling. The accuracy of our techniques will always upper bound

all sample-based methods if the image interpolant is known (see Section 3.4). In fact,

the estimate obtained by all sample-based methods will converge to that yielded by

our method only in the limit when the number of samples tends to infinity. Empirically,

we demonstrate the robustness of our technique to noise, and superior performance in

image registration. We conclude with a discussion and clarification of some properties of

our method.

3.2 Image Entropy and Mutual Information

We are ultimately interested in using the estimated values of the joint density

p(α1,α2) to calculate (Shannon) joint entropy and MI. A major concern is that, in the limit

53

as the bin-width h → 0, the Shannon entropy does not approach the continuous entropy,

but becomes unbounded [39]. There are two ways to deal with this. Firstly, a normalized

version of the joint entropy (NJE) obtained by dividing the Shannon joint entropy (JE) by

logP (where P is the number of bins), could be employed instead of the Shannon joint

entropy. As h → 0 and the Shannon entropy tends toward +∞, NJE would still remain

stable, owing to the division by logP, which would also tend toward +∞ (in fact, NJE will

have a maximal upper bound of logP2

logP= 2, for a uniform joint distribution). Alternatively

(and this is the more principled strategy), we observe that unlike the case with Shannon

entropy, the continuous MI is indeed the limit of the discrete MI as h → 0 (see [39] for the

proof). Now, as P increases, we effectively obtain an increasingly better approximation

to the continuous mutual information.

In the multiple image case (d > 2), we avoid using a pair-wise sum of MI values

between different image pairs, because such a sum ignores the simultaneous joint

overlap between multiple images. Instead, we can employ measures such as modified

mutual information (MMI) [15], which is defined as the KL divergence between the d-way

joint distribution and the product of the marginal distributions, or its normalized version

(MNMI) obtained by dividing MMI by the joint entropy. The expressions for MI between

two images and MMI for three images are given below:

MI (I1, I2) = H1(I1) + H2(I2)− H12(I1, I2) (3–1)

which can be explicitly written as

MI (I1, I2) =∑j1

∑j2

p(j1, j2) logp(j1, j2)

p(j1)p(j2)(3–2)

where the summation indices j1 and j2 range over the sets of possibilities of I1 and I2

respectively. For three images,

MMI (I1, I2, I3) = H1(I1) + H2(I2) + H3(I3)− H123(I1, I2, I3) (3–3)

54

which has the explicit form

MMI (I1, I2, I3) =∑j1

∑j2

∑j3

p(j1, j2, j3) logp(j1, j2, j3)

p(j1)p(j2)p(j3)(3–4)

where the summation indices j1, j2 and j3 range over the sets of possibilities of I1, I2

and I3 respectively. Though NMI (normalized mutual information) and MNMI are not

compatible in the discrete and continuous formulations (unlike MI and MMI), in our

experiments, we ignored this fact as we chose very specific intensity levels.

3.3 Experimental Results

In this section, we describe our experimental results for (a) the case of registration

of two images in 2D, (b) the case of registration of multiple images in 2D and (c) the

case of registration of two images in 3D.

3.3.1 Registration of Two images in 2D

For this case, we took pre-registered MR-T1 and MR-T2 slices from Brainweb [19],

down-sampled to size 122 × 146 (see Figure 2-12) and created a 20 rotated version

of the MR-T2 slice. To this rotated version, zero-mean Gaussian noise of different

variances was added using the imnoise function of MATLAB R©. The chosen variances

were 0.01, 0.05, 0.1, 0.2, 0.5, 1 and 2. All these variances are chosen for an intensity

range between 0 and 1. To create the probability distributions, we chose bin counts of

16, 32, 64 and 128. For each combination of bin-count and noise, a brute-force search

was performed so as to optimally align the synthetically rotated noisy image with the

original one, as determined by finding the maximum of MI or NMI between the two

images. Six different techniques were used for MI estimation: (1) simple histograms with

bilinear interpolation for image warping (referred to as “Simple Hist”), (2) our proposed

method using iso-contours (referred to as “Iso-contours”), (3) histogramming with

partial volume interpolation (referred to as “PVI”) (4) histogramming with cubic spline

interpolation (referred to as “Cubic”), (5) the method 2DPointProb proposed in [10], and

(6) simple histogramming with 106 samples taken from sub-pixel locations uniformly

55

randomly followed by usual binning (referred to as “Hist Samples”). These experiments

were repeated for 30 noise trials at each noise standard deviation. For each method, the

mean and the variance of the error (absolute difference between the predicted alignment

and the ground truth alignment) was measured (Figure 3-1). The same experiments

were also performed using a Parzen-window based density estimator using a Gaussian

kernel and σ = 5 (referred to as “Parzen”) over 30 trials. In each trial, 10,000 samples

were chosen. Out of these, 5000 were chosen as centers for the Gaussian kernel and

the rest were used for the sake of entropy computation. The error mean and variance

was recorded (see Table 3-1).

The adjoining error plots (Figure 3-1) show results for all these methods for

all bins counts, for noise levels of 0.05, 0.2 and 1. The accompanying trajectories

(for all methods except histogramming with multiple sub-pixel samples) with MI for

bin-counts of 32 and 128 and noise level 0.05, 0.2 and 1.00 are shown as well, for sake

of comparison, for one arbitrarily chosen noise trial (Figure 3-2). From these figures,

one can appreciate the superior resistance to noise shown by both our methods, even

at very high noise levels, as evidenced both by the shape of the MI and NMI trajectories,

as well as the height of the peaks in these trajectories. Amongst the other methods,

we noticed that PVI is more stable than simple histogramming with either bilinear or

cubic-spline based image warping. In general, the other methods perform better when

the number of histogram bins is small, but even there our method yields a smoother

MI curve. However, as expected, noise does significantly lower the peak in the MI as

well as NMI trajectories in the case of all methods including ours, due to the increase

in joint entropy. Though histogramming with 106 sub-pixel samples performs well (as

seen in Figure 3-1), our method efficiently and directly (rather than asymptotically)

approaches the true PDF and hence the true MI value, under the assumption that we

have access to the true interpolant. Parzen windows with the chosen σ value of 5 gave

good performance, comparable to our technique, but we wish to re-emphasize that the

56

choice of the parameter was arbitrary and the computation time was much more for

Parzen windows.

All the aforementioned techniques were also tested on affine image registration

(except for histogramming with multiple sub-pixel samples and Parzen windowing,

which were found to be too slow). For the same image as in the previous experiment,

an affine-warped version was created using the parameters θ = 30 = 30, t = -0.3,

s = -0.3 and φ = 0. During our experiments, we performed a brute force search on

the three-dimensional parameter space so as to find the transformation that optimally

aligned the second image with the first one. The exact parameterization for the affine

transformation is given in [40]. Results were collected for a total of 20 noise trials

and the average predicted parameters were recorded as well as the variance of

the predictions. For a low noise level of 0.01 or 0.05, we observed that all methods

performed well for a quantization up to 64 bins. With 128 bins, all methods except

the two we have proposed broke down, i.e. yielded a false optimum of θ around 38,

and s and t around 0.4. For higher noise levels, all methods except ours broke down

at a quantization of just 64 bins. The 2DPointProb technique retained its robustness

until a noise level of 1, whereas the area-based technique still produced an optimum

of θ = 28, s = -0.3, t = -0.4 (which is very close to the ideal value). The area-based

technique broke down only at an incredibly high noise level of 1.5 or 2. The average and

standard deviation of the estimate of the parameters θ, s and t, for 32 and 64 bins, for all

five methods and for noise levels 0.2 and 1.00 are presented in Tables 3-2 and 3-3. We

also performed two-sided Kolmogorov-Smirnov tests [41] for statistical significance on

the absolute errors (between the true and estimated affine transformation parameters)

yielded by standard histogramming and the isocontour method, both for 64 bins

and a noise of variance 1. We found that the difference in the error values for MI, as

computed using standard histogramming and our iso-contour technique, was statistically

significant, as ascertained at a level of 0.01.

57

We also performed experiments on determining the angle of rotation using larger

images with varying levels of noise (σ = 0.05, 0.2, 1). The same Brainweb images,

as mentioned before, were used, except that their original size of 183 × 219 was

retained. For a bin count up to 128, all/most methods performed quite well (using a

brute-force search) even under high noise. However with a large bin count (256 bins),

the noise resistance of our method stood out. The results of this experiment with

different methods and under varying noise are presented in Tables 3-4, 3-5 and 3-6.

3.3.2 Registration of Multiple Images in 2D

The images used were pre-registered MR-PD, MR-T1 and MR-T2 slices (from

Brainweb) of sizes 90 x 109. The latter two were rotated by θ1 = 20 and by θ2 = 30

respectively (see Figure 3-3). For different noise levels and intensity quantizations,

a set of experiments was performed to optimally align the latter two images with the

former using modified mutual information (MMI) and its normalized version (MNMI) as

criteria. These criteria were calculated using our area-based method as well as simple

histogramming with bilinear interpolation. The range of angles was from 1 to 40 in

steps of 1. The estimated values of θ1 and θ2 are presented in Table 3-7.

3.3.3 Registration of Volume Datasets

Experiments were performed on sub-volumes of size 41 × 41 × 41 from MR-PD

and MR-T2 datasets from the Brainweb simulator [19]. The MR-PD portion was warped

by 20 about the Y as well as Z axes. A brute-force search (from 5 to 35 in steps of

1, with a joint PMF of 64 × 64 bins) was performed so as to optimally register the

MR-T2 volume with the pre-warped MR-PD volume. The PMF was computed both

using LengthProb as well as using simple histogramming, and used to compute the

MI/NMI just as before. The computed values were also plotted against the two angles as

indicated in the top row of Figure 3-4. As the plots indicate, both the techniques yielded

the MI peak at the correct point in the θY , θZ plane, i.e. at 20, 20. When the same

experiments were run using VolumeProb, we observed that the joint PMF computation

58

for the same intensity quantization was more than ten times slower. Similar experiments

were performed for registration of three volume datasets in 3D, namely 41 × 41 × 41

sub-volumes of MR-PD, MR-T1 and MR-T2 datasets from Brainweb. The three datasets

were warped through −2, −21 and −30around the X axis. A brute force search was

performed so as to optimally register the latter two datasets with the former using MMI

as the registration criterion. Joint PMFs of size 64 × 64 × 64 were computed and these

were used to compute the MMI between the three images. The MMI peak occurred

when the second dataset was warped through θ2 = 19 and the third was warped

through θ3 = 28, which is the correct optimum. The plots of the MI values calculated by

simple histogramming and 3DPointProb versus the two angles are shown in Figure 3-4

(bottom row) respectively.

The next experiment was designed to check the effect of zero mean Gaussian

noise on the accuracy of affine registration of the same datasets used in the first

experiment, using histogramming and LengthProb. Additive Gaussian noise of variance

σ2 was added to the MR-PD volume. Then, the MR-PD volume was warped by a 4 × 4

affine transformation matrix (expressed in homogeneous coordinate notation) given

as A = SHRzRyRxT where Rz , Ry and Rx represent rotation matrices about the Z ,

Y and X axes respectively, H is a shear matrix and S represents a diagonal scaling

matrix whose diagonal elements are given by 2sx , 2sy and 2sz . (A translation matrix T

is included as well. For more information on this parameterization, please see [42].)

The MR-T1 volume was then registered with the MR-PD volume using a coordinate

descent on all parameters. The actual transformation parameters were chosen to be 7

for all angles of rotation and shearing, and 0.04 for sx , sy and sz . For a smaller number

of bins (32), it was observed that both the methods gave good results under low noise

and histogramming occasionally performed better. Table 3-8 shows the performance

of histograms and LengthProb for 128 bins, over 10 different noise trials. Summarily,

we observed that our method produced superior noise resistance as compared to

59

histogramming when the number of bins was larger. To evaluate the performance on real

data, we chose volumes from the Visible Human Dataset2 (Male). We took sub-volumes

of MR-PD and MR-T1 volumes of size 101 × 101 × 41 (slices 1110 to 1151). The

two volumes were almost in complete registration, so we warped the former using an

affine transformation matrix with 5 for all angles of rotation and shearing, and value

of 0.04 for sx , sy and sz resulting in a matrix with sum of absolute values 3.6686. A

coordinate descent algorithm for 12 parameters was executed on mutual information

calculated using LengthProb so as to register the MR-T1 dataset with the MR-PD

dataset, producing a registration error of 0.319 (see Figure 3-5).

3.4 Discussion

Thus far in this chapter and the previous one, we have presented a new density

estimator which is essentially geometric in nature, using continuous image representations

and treating the probability density as area sandwiched between iso-contours at

intensity levels that are infinitesimally apart. We extended the idea to the case of joint

density between two images, both in 2D and 3D, as also the case of multiple images

in 2D. Empirically, we showed superior noise resistance on registration experiments

involving rotations and affine transformations. Furthermore, we also suggested a

faster, biased alternative based on counting pixel intersections which performs well,

and extended the method to handle volume datasets. The relationship between our

techniques and histogramming with multiple sub-pixel samples was also discussed.

There are a few clarifications in order as follows:

1. Comparison to histogramming on an up-sampled image If an image isup-sampled several times and histogramming is performed on it, there will be moresamples for the histogram estimate. At a theoretical level, though, there is stillthe issue of not being able to relate the number of bins to the available number of

2 Obtained from the Visible Human Project R© (http://www.nlm.nih.gov/research/visible/getting_data.html).

60

samples. Furthermore, it is recommended that the rate of increase in the numberof bins be less than the square root of the number of samples for computing thejoint density between two images [16], [43]. If there are d images in all, the numberof bins ought to be less than N

1d , where N is the total number of pixels, or samples

to be taken [16], [43]. Consider that this criterion suggested that N samples wereenough for a joint density between two images with χ bins. Suppose that we nowwished to compute a joint density with χ bins for d images of the same size. Thiswould require the images to be up-sampled by a factor of at least N

d−22 , which

is exponential in the number of images. Our simple area-based method clearlyavoids this problem.

2. Choice of interpolant We chose a (piece-wise) linear interpolant for the sake ofsimplicity, though in principle any other interpolant could be used. It is true that weare making an assumption on the continuity of the intensity function which maybe violated in natural images. However, given a good enough resolution of theinput image, interpolation across a discontinuity will have a negligible impact onthe density as those discontinuities are essentially a measure zero set. One couldeven incorporate an edge-preserving interpolant [44] by running an anisotropicdiffusion to detect the discontinuities and then taking care not to interpolate acrossthe two sides of an edge.

3. Non-differentiability The PDF estimates of our method are not differentiable,which can pose a problem for non-rigid registration applications. Differentiabilitycould be achieved by fitting (say) a spline to the obtained probability tables.However, this again requires smoothing the density estimate in a manner that isnot tied to the image geometry. Hence, this goes against the philosophy of ourapproach. For practical or empirical reasons, however, there is no reason why oneshould not experiment with this. Moreover, currently, we do not have a closed formexpression for our density estimate. Expressing the marginal and joint densitiessolely in terms of the parameters of the chosen image interpolant is a challengingproblem.

Table 3-1. Average and std. dev. of error in degrees (absolute difference between trueand estimated angle of rotation) for MI using Parzen windows

Noise Variance Avg. Error Std. Dev. of Error0.05 0.0667 0.440.2 0.33 0.81 3.6 32 4.7 12.51

61

1 2 3 4 5 6−1

−0.5

0

0.5

1

1.5

A1 2 3 4 5 6

−0.5

0

0.5

1

1.5

B1 2 3 4 5 6

−0.5

0

0.5

1

1.5

C1 2 3 4 5 6

−0.5

0

0.5

1

1.5

2

D

1 2 3 4 5 6−5

0

5

10

15

20

E1 2 3 4 5 6

−5

0

5

10

15

20

F1 2 3 4 5 6

−5

0

5

10

15

20

G1 2 3 4 5 6

−5

0

5

10

15

20

H

1 2 3 4 5 6−5

0

5

10

15

20

I1 2 3 4 5 6

−5

0

5

10

15

20

J1 2 3 4 5 6

−5

0

5

10

15

20

K1 2 3 4 5 6

−5

0

5

10

15

20

25

30

L

Figure 3-1. Graphs showing the average error A (i.e. abs. diff. between the estimatedand the true angle of rotation) and error standard deviation S with MI as thecriterion for 16, 32, 64, 128 bins (row-wise) with a noise of 0.05 [ from (A) to(D)], with a noise of 0.2 [from (E) to (H)] and with a noise of 1 [from (I) to (L)].Inside each sub-figure, error-bars are plotted for six diff. methods, in the foll.order: Simple Histogramming, Iso-contours, PVI, Cubic, 2DPointProb,Histogramming with 106 samples. Error-bars show the values of A− S , A,A+ S . If S is small, only the value of A is shown.

62

0 10 20 30 40 500

0.1

0.2

0.3

0.4

ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb

0 10 20 30 40 500

0.2

0.4

0.6

0.8

ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb

0 10 20 30 40 500

0.05

0.1

0.15

0.2

ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb

0 10 20 30 40 500

0.2

0.4

0.6

0.8

ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb

0 10 20 30 40 500

0.02

0.04

0.06

0.08

ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb

0 10 20 30 40 500

0.1

0.2

0.3

0.4

0.5

ISOCONTOURSHIST BILINEARPVIHIST CUBIC2DPointProb

Figure 3-2. First two: MI for 32, 128 bins with noise level of 0.05; Third and fourth: with anoise level of 0.2; Fifth and sixth: with a noise level of 1.0. In all plots, darkblue: iso-contours, cyan: 2DPointProb, black: cubic, red: simplehistogramming, green: PVI. (Note: These plots should be viewed in color.)

63

A B C

Figure 3-3. MR slices of the brain (A) MR-PD slice, (B) MR-T1 slice rotated by 20degrees, (C) MR-T2 slice rotated by 30 degrees

020

40

0

20

400

1

2

A

020

40

0

20

400

1

2

B

020

40

0

20

401

2

3

4

C

020

40

0

20

400

1

2

3

D

Figure 3-4. MI computed using (A) histogramming and (B) LengthProb (plotted versusθY and θZ ); MMI computed using (C) histogramming and (D) 3DPointProb(plotted versus θ2 and θ3)

64

Figure 3-5. TOP ROW: original PD image (left), warped T1 image (middle), imageoverlap before registration (right), MIDDLE ROW: PD image warped usingpredicted matrix (left), warped T1 image (middle), image overlap afterregistration (right). BOTTOM ROW: PD image warped using ideal matrix(left), warped T1 image (middle), image overlap after registration in the idealcase (right)

65

Table 3-2. Average value and variance of parameters θ, s and t predicted by variousmethods (32 and 64 bins, noise σ = 0.2); Ground truth: θ = 30, s = t = −0.3

Method Bins θ s t

MI Hist 32 30, 0 -0.3, 0 -0.3, 0NMI Hist 32 30, 0 -0.3, 0 -0.3, 0MI Iso 32 30, 0 -0.3, 0 -0.3, 0

NMI Iso 32 30, 0 -0.3, 0 -0.3, 0MI PVI 32 30, 0 -0.3, 0 -0.3, 0

NMI PVI 32 30, 0 -0.3, 0 -0.3, 0MI Spline 32 30.8,0.2 -0.3, 0 -0.3, 0

NMI Spline 32 30.6,0.7 -0.3, 0 -0.3, 0MI 2DPt. 32 30, 0 -0.3, 0 -0.3, 0

NMI 2DPt. 32 30, 0 -0.3, 0 -0.3, 0MI Hist 64 29.2,49.7 0.4, 0 0.27, 0.07

NMI Hist 64 28.8,44.9 0.4, 0 0.33, 0.04MI Iso 64 30, 0 -0.3, 0 -0.3,0

NMI Iso 64 30, 0 -0.3, 0 -0.3, 0MI PVI 64 30, 0 -0.3, 0 -0.3, 0

NMI PVI 64 30, 0 -0.3, 0 -0.3, 0MI Spline 64 24,21.5 0.4, 0 0.33, 0.04

NMI Spline 64 24.3,20.9 0.4,0 0.33, 0.04MI 2DPt. 64 30, 0 -0.3, 0 -0.3, 0

NMI 2DPt. 64 30, 0 -0.3, 0 -0.3, 0

66

Table 3-3. Average value and variance of parameters θ, s and t predicted by variousmethods (32 and 64 bins, noise σ = 1); Ground truth: θ = 30, s = t = −0.3

Method Bins θ s t

MI Hist 32 33.7, 18.1 0.4, 0 0.13,0.08NMI Hist 32 34.3, 15.9 0.4, 0 0.13, 0.08MI Iso 32 30,0.06 -0.3, 0 -0.3, 0

NMI Iso 32 30,0.06 -0.3, 0 -0.3, 0MI PVI 32 28.1, 36.25 0.26, 0.08 0.19, 0.1

NMI PVI 32 28.1, 36.25 0.3, 0.05 0.21,0.08MI Spline 32 30.3,49.39 0.4, 0 0.09,0.1

NMI Spline 32 31.2,48.02 0.4, 0 0.05,0.1MI 2DPt. 32 30.3,0.22 -0.3, 0 -0.3, 0

NMI 2DPt. 32 30.3,0.22 -0.3, 0 -0.3, 0MI Hist 64 27.5, 44.65 0.4, 0 0.25,0.08

NMI Hist 64 27,43.86 0.4, 0 0.246, 0.08MI Iso 64 30.5, 0.12 -0.27, 0.035 -0.28, 0.02

NMI Iso 64 31.2, 0.1 -0.27, 0.058 -0.28, 0.02MI PVI 64 26.2,36.96 0.4, 0 0.038,0

NMI PVI 64 26.8,41.8 0.4, 0 0.038,0MI Spline 64 25.9,40.24 0.4, 0 0.3, 0.06

NMI Spline 64 25.7,26.7 0.4, 0 0.3, 0.06MI 2DPt. 64 30.5, 0.25 -0.24, 0.0197 -0.23, 0.01

NMI 2DPt. 64 30.5, 0.25 -0.26, 0.0077 -0.22, 0.02

Table 3-4. Average error (absolute diff.) and variance in measuring angle of rotationusing MI, NMI calculated with different methods, noise σ = 0.05

Method 128 bins 256 binsMI Hist. 0,0 0.13,0.115

NMI Hist. 0,0 0.067,0.062MI Iso. 0,0 0,0

NMI Iso. 0,0 0,0MI PVI 0,0 0,0

NMI PVI 0,0 0,0MI Spline 0,0 0.33,0.22

NMI Spline 0,0 0.33,0.22MI 2DPt. 0,0 0,0

NMI 2D Pt. 0,0 0,0

67

Table 3-5. Average error (absolute diff.) and variance in measuring angle of rotationusing MI, NMI calculated with different methods, noise σ = 0.2

Method 128 bins 256 binsMI Hist. 0.07,0.196 0.2,0.293

NMI Hist. 0.07,0.196 0.13,0.25MI Iso. 0,0 0,0

NMI Iso. 0,0 0,0MI PVI 0,0 0,0

NMI PVI 0,0 0,0MI Spline 2.77,10 4.77,10

NMI Spline 2.77,10 18,0.06MI 2DPt. 0,0 0,0

NMI 2D Pt. 0,0 0,0

Table 3-6. Average error (absolute diff.) and variance in measuring angle of rotationusing MI, NMI calculated with different methods, noise σ = 1

Method 128 bins 256 binsMI Hist. 1.26,31 27.9,3.1

NMI Hist. 1.2,30 28,3.3MI Iso. 0,0 0,0

NMI Iso. 0,0 0,0MI PVI 0,0.26 26.9,14.3

NMI PVI 0,0.26 26.8,14.5MI Spline 10,0.2 18,0.33

NMI Spline 9.8,0.15 18,0.06MI 2DPt. 0.07,0.06 0.07,0.06

NMI 2D Pt. 0.267,0.32 0.07,0.06

Table 3-7. Three image case: angles of rotation using MMI, MNMI calculated with theiso-contour method and simple histograms, for noise variance σ = 0.05, 0.1, 1(Ground truth 20 and 30)

Noise Variance Method 32 bins 64 bins0.05 MMI Hist. 21,30 22,310.05 MNMI Hist. 21,30 22,310.05 MMI Iso. 20,30 20,300.05 MNMI Iso. 20,30 20,300.2 MMI Hist. 15,31 40,80.2 MNMI Hist. 15,31 40,80.2 MMI Iso. 22,29 20,300.2 MNMI Iso. 22,29 20,301 MMI Hist. 40,9 38,41 MNMI Hist. 40,9 34,41 MMI Iso. 22,30 35,231 MNMI Iso. 22,30 40,3

68

Table 3-8. Error (average, std. dev.) validated over 10 trials with LengthProb andhistograms for 128 bins; R refers to the intensity range of the image

Noise Level Error with LengthProb Error with histograms0 0.09, 0.02 0.088, 0.009√50R 0.135, 0.029 0.306, 0.08√100R 0.5, 0.36 1.47, 0.646√150R 0.56, 0.402 1.945, 0.56

69

CHAPTER 4APPLICATION TO IMAGE FILTERING

4.1 Introduction

Filtering of images has been one of the most fundamental problems studied in

low-level vision and signal processing. Over the past decades, several techniques for

data filtering have been proposed with impressive results on practical applications

in image processing. As straightforward image smoothing is known to blur across

significant image structures, several anisotropic approaches to image smoothing have

been developed using partial differential equations (PDEs) with stopping terms to

control image diffusion in different directions [44]. The PDE-based approaches have

been extended to filtering of color images [45] and chromaticity vector fields [46]. Other

popular approaches to image filtering include adaptive smoothing [47] and kernel

density estimation based algorithms [48]. All these methods produce some sort of

weighted average over an image neighborhood for the purpose of data smoothing,

where the weights are obtained from the difference between the intensity values of the

central pixel and the pixels in the neighborhood, or from the pixel gradient magnitudes.

Beyond this, techniques such as bilateral filtering [49] produce a weighted combination

that is also influenced by the relative location of the central pixel and the neighborhood

pixels. The highly popular mean-shift procedure [50], [51] is grounded in similar ideas

as bilateral filtering, with the addition that the neighborhood around a pixel is allowed

to change dynamically until a convergence criterion is met. The authors prove that this

convergence criterion is equivalent to finding the mode of a local density built jointly on

the spatial parameters (image domain) and the intensity parameters (image range).

In this chapter, we present a new approach to data filtering that is rooted in simple

yet elegant geometric intuitions. At the core of our theory is the representation of an

image as a function that is at least C0 continuous everywhere. A key property of the

image level sets is used to drive the diffusion process, which we then incorporate in a

70

framework of dynamic neighborhoods a la mean-shift. We demonstrate the relationship

of our method to many of the existing filtering techniques such as those driven by

kernel density estimation. The efficacy of our approach is supported with extensive

experimental results. To the best of our knowledge, ours is the first attempt to explicitly

utilize image geometry (in terms of its level curves) for this particular application.

This chapter is organized as follows. Section 2 presents the key theoretical

framework. Section 3 presents extensions to our theory. In section 4, we present

the relationship between our method and mean-shift. Extensive experimental results are

presented in section 5, and we present further discussions and conclusions in section 6.

All or most of the material contained in this chapter has been previously published by the

author in [52]1 .

4.2 Theory

Consider an image over a discrete domain Ω = 1, ...,H × 1, ...,W where

the intensity of each discrete location (x , y) is given by I (x , y). Moreover consider

a neighborhood N (xi , yi) around the pixel (xi , yi). It is well-known that a simple

averaging of all intensity values in N (xi , yi) will blur edges, so a weighted combination is

calculated, where the weight of the j th pixel is given by w (1)(xj , yj) = g(|I (xi , yi)−I (xj , yj)|)

for a non-increasing function g(.) to facilitate anisotropic diffusion, with common

examples being g(z) = e−z2

σ2 or g(z) = σ2

σ2+z2, or their truncated versions. This approach

is akin to the kernel density estimation (KDE) approach proposed in [48], where the

1 Parts of the content of this chapter have been reprinted with permission from: A.Rajwade, A. Banerjee and A. Rangarajan, ‘Image Filtering by Level Curves’, EnergyMinimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), 2009,pages 359-372. c©2009, Springer Verlag.

71

filtered value of the central pixel is calculated as:

I (xi , yi) =

∑(xj ,yj )∈N (xi ,yi )

I (xj , yj)K(I (xj , yj)− I (xi , yi);Wr)∑(xj ,yj )∈N (xi ,yi )

K(I (xj , yj)− I (xi , yi);Wr). (4–1)

Here the kernel K centered at I (xi , yi) (and parameterized byWr ) is related to the

function g and determines the weights. The major limitations of the kernel based

approach to anisotropic diffusion are that the entire procedure is sensitive to the

parameterWr and the size of the neighborhood, and might suffer from a small-sample

size problem. Furthermore, in a discrete implementation, for any neighborhood size

larger than 3 × 3, the procedure depends only on the actual pixel values and does not

account for any gradient information, whereas in a filtering application, it is desirable

to place greater importance on those regions of the neighborhood where the gradient

values are lower.

Now consider that the image is treated as a continuous function I (x , y) of the spatial

variables, by interpolating in between the pixel values. The earlier discrete average is

replaced by the following continuous average to update the value at (xi , yi):

I (xi , yi) =

∫ ∫N (xi ,yi )

I (x , y)g(|I (x , y)− I (xi , yi)|)dxdy∫ ∫N (xi ,yi )

g(|I (x , y)− I (xi , yi)|)dxdy. (4–2)

The above formula is usually not available in closed form. We now show a principled

approximation to this formula, by resorting to geometric intuition. Imagine a contour map

of this image, with multiple iso-intensity level curves Cm = (x , y)|I (x , y) = αm (referred

to henceforth as ‘level curves’) separated by an intensity spacing of ∆. Consider a

portion of this contour map in a small neighborhood centered around the point (xi , yi)

(see Figure 4-1A). Those regions where the level curves (separated by a fixed intensity

spacing) are closely packed together correspond to the higher-gradient regions of the

neighborhood, whereas in lower-gradient regions of the image, the level curves lie

72

far away from one another. Now as seen in Figure 4-1A, this contour map induces a

tessellation of the neighborhood into some K facets, where each facet corresponds to a

region in between two level curves of intensity αm and αm + ∆, bounded by the rim of the

neighborhood. Let the area ak of the k th facet of this tessellation be denoted as ak . Now,

if we make ∆ sufficiently small, we can regard even the facets from high-gradient regions

as having constant intensity value Ik = αm. This now leads to the following weighted

average in which the weighting function has a very clean geometric interpretation, unlike

the arbitrary choice for w (1) in the previous technique:

I (xi , yi) =

K∑k=1

ak Ikg(|Ik − I (xi , yi)|)

K∑k=1

akg(|Ik − I (xi , yi)|). (4–3)

As the number of facets is typically much larger than the number of pixels, and given

the fact that the facets have arisen from a locally smooth interpolation method to obtain

a continuous function from the original digital pixel values, we now have a more robust

average than that provided by Equation 4–1. To introduce anisotropy, we still require the

stopping term g(|Ik − I (xi , yi)|) to prevent smearing across the edge, just as in Equation

4–1.

Equation 4–2 essentially performs an integration of the intensity function over the

domain N (xi , yi). If we now perform a change of variables transforming the integral on

(x , y) to an integral over the range of the image, we obtain the expression

I (xi , yi) =

∫ ∫N (xi ,yi )

I (x , y)w (1)(x , y)dxdy∫ ∫N (xi ,yi )

w (1)(x , y)dxdy

=

∫ q=q2q=q1

∫C(q)

qg(|q − I (xi , yi)|)|∇I |

dldq∫ q=q2q=q1

∫C(q)

g(|q − I (xi , yi)|)|∇I |

dldq

=

lim∆→0

q2∑α=q1

∫ α+∆

q=α

∫C(q)

qg(|q − I (xi , yi)|)|∇I |

dldq

lim∆→0

q2∑α=q1

∫ q=α+∆q=α

∫C(q)

g(|q − I (xi , yi)|)|∇I |

dldq

(4–4)

73

where C(q) = N (xi , yi) ∩ f −1(q), q1 = infI (x , y)|(x , y) ∈ N (xi , yi), q2 =

supI (x , y)|(x , y) ∈ N (xi , yi) and l stands for a tangent along the curve f −1(q).

This approach is inspired by the smooth co-area formula for regular functions [53] which

is given as ∫Ω

φ(u)|∇u|dxdy =∫ +∞−∞

Length(γq)φ(q)dq (4–5)

where γq is the level set of u at the intensity q and φ(u) represents a function of u.

Note that the term∫ q=α+∆q=α

∫C(q)

dldq|∇I | in Equation 4–4 actually represents the area in

N (xi , yi) that is trapped between two contours whose intensity value differs by ∆. Our

work described in the previous chapters considers this quantity when normalized

by |Ω| to be actually equal to the probability that the intensity value lies in the range

[α,α + ∆]. Bearing this in mind, Equation 4–3 now acquires the following probabilistic

interpretation:

I (xi , yi) =

q2∑α=q1

Pr(α < I < α+ ∆|N )αg(|α− I (xi , yi)|)

q2∑α=q1

Pr(α < I < α+ ∆|N )g(|α− I (xi , yi)|). (4–6)

As ∆→ 0, this produces an increasingly better approximation to Equation 4–2.

It should be pointed out that there exist methods such as adaptive filtering [47], [54]

in which the weights in Equation 4–1 are obtained as w (2)(xj , yj) = g(|∇I (xj , yj)|). These

methods place more importance on the lower-gradient pixels of the neighborhood, but

do not exploit level curve relationships in the way we do, and the choice of the weighting

function does not have the geometric interpretation that exists in our technique.

Moreover the original formulation in [47] was designed for 3 × 3 neighborhoods. For

larger neighborhoods, the gradient-based terms will have to be augmented with an

intensity-based term to prevent blurring across edges.

There also exists an extension to the standard neighborhood filter in Equation

4–1 reported in [55], which performs a weighted least squares polynomial fit to the

74

intensity values (of the pixels) in the neighborhood of a location (x , y). The value of

this polynomial at (x , y) is then considered to be the smoothed intensity value. This

technique differs from the one we present here in two fundamental ways. Unlike our

method, it does not use areas between level sets as weights to explicitly perform a

weighted averaging. Secondly as proved in [55], its limiting behavior whenWr → 0 and

|N (x , y)| → 0 resembles that of the geometric heat equation with a linear polynomial,

and resembles higher order PDEs when the degree of the polynomial is increased. Our

method is the true continuous form of the KDE-based filter from Equation 4–1. This

KDE-based filter behaves like the Perona-Malik filter, as proved in [55].

4.3 Extensions of Our Theory

4.3.1 Color Images

We now extend our technique to color (RGB) images. Consider a color image

defined as I (x , y) = (R(x , y),G(x , y),B(x , y)) : Ω → R3 where Ω ⊂ R2. In color

images, there is no concept of a single iso-contour with constant values of all three

channels. Hence it is more sensible to consider an overlay of the individual iso-contours

of the R, G and B channels. The facets are now induced by a tessellation involving the

intersection of three iso-contour sets within a neighborhood, as shown in Figure 4-1B.

Each facet represents those portions of the neighborhood for which αR < R(x , y) <

αR + ∆R ,αG < G(x , y) < αG + ∆G ,αB < B(x , y) < αB + ∆B . The probabilistic

interpretation for the update on the R,G,B values is as follows

R(xi , yi), G(xi , yi), B(xi , yi) =

∑~β

Pr[~β < (R,G ,B) < ~β + ~∆|N )~βg(R,G ,B)

∑~β

Pr[~β < (R,G ,B) < ~β + ~∆|N )g(R,G ,B)

where ~β = (αR ,αG ,αB), ~∆ = (∆R , ∆G , ∆B) and g(R,G ,B) = g(|R − R(xi , yi)| + |G −

G(xi , yi)| + |B − B(xi , yi)|). Note that in this case, I (x , y) is a function from a subset of

R2 to R3, and hence the three-dimensional joint density is ill-defined in the sense that

it is defined strictly on a 2D subspace of R3. However given that the implementation

75

considers joint cumulative interval measures, this does not pose any problem in a

practical implementation. We wish to emphasize that the averaging of the R,G,B values

is performed in a strictly coupled manner, all affected by the joint cumulative interval

measure.

4.3.2 Chromaticity Fields

Previous research on filtering chromaticity noise (which affects only the direction

and not the magnitude of the RGB values at image pixels) includes the work in [46]

using PDEs specially tuned for unit-vector data, and the work in [48] (page 142) using

kernel density estimation for directional data. The more recent work on chromaticity

filtering in [56] actually treats chromaticity vectors as points on a Grassmann manifold

G1,3 as opposed to treating them as points on S2, which is the approach presented here

and in [48] and [46].

We extend our theory from the previous section to unit vector data and incorporate

it in a mean-shift framework for smoothing. Let I (x , y) : Ω → R3 be the original

RGB image, and let J(x , y) : Ω → S2 be the corresponding field of chromaticity

vectors. A possible approach would involve interpolating the chromaticity vectors

by means of commonly used spherical interpolants to create a continuous function,

followed by tracing the level curves of the individual unit-vector components ~v(x , y) =

(v1(x , y), v2(x , y), v3(x , y)) and computing their intersection. However for ease of

implementation for this particular application, we resorted to a different strategy. If the

intensity intervals ~∆ = (∆R , ∆G , ∆B) are chosen to be fine enough, then each facet

induced by a tessellation that uses the level curves of the R, G and B channel values,

can be regarded as having a constant color value, and hence the chromaticity vector

values within that facet can be regarded as (almost) constant. Therefore it is possible

to use just the R,G,B level curves for the task of chromaticity smoothing as well. The

update equation is very similar to Equation 4–7 with the R,G,B vectors replaced by their

unit normalized versions. However as the averaging process does not preserve the unit

76

norm, the averaged vector needs to be renormalized to produce the spherical weighted

mean.

4.3.3 Gray-scale Video

For the purpose of this application, the video is treated as a single 3D signal

(volume). The extension in this case is quite straightforward, with the areas between

level curves being replaced by volumes between the level surfaces at nearby intensities.

However we take into account the causality factor in defining the temporal component of

the neighborhood around a pixel, by performing the averaging at each pixel over frames

only from the past.

4.4 Level Curve Based Filtering in a Mean Shift Framework

All the above techniques are based on an averaging operation over only the image

intensities (i.e. in the range domain). On the other hand, techniques such as bilateral

filtering [49] or local mode-finding [57] combine both range and spatial domain, thus

using weights of the form wj = g(s)((xi − xj)2 + (yi − yj)2)g(r)(|(I (xi , yi) − I (xj , yj)|) in

Equation 4–1, where g(s) and g(r) affect the spatial and range kernels respectively. The

mean-shift framework [51] is based on similar principles, but changes the filter window

dynamically for several iterations until it finds a local mode of the joint density of the

spatial and range parameters, estimated using kernels based on the functions g(r) and

g(s). Our level curve based approach fits easily into this framework with the addition

of a spatial kernel. One way to do this would be to consider the image as a surface

embedded in 3D (a Monge patch), as done in [58], and compute areas of patches in

3D for the probability values. However such an approach may not necessarily favor

the lower gradient areas of the image. Instead we adopt another method wherein we

assume two additional functions of x and y , namely X (x , y) = x and Y (x , y) = y . We

compute the joint probabilities for a range of values of the joint variable (X ,Y , I ) by

drawing local level sets and computing areas in 2D. Assuming a uniform spatial kernel

for g(s) within a radiusWs and a rectangular kernel on the intensity for g(r) with threshold

77

valueWr (though our core theory is unaffected by other choices), we now perform the

averaging update on the vector (X (x , y),Y (x , y), I (x , y)), as opposed to merely on

I (x , y) as was done in Equation 4–6. This is given as:

(X (xi , yi),Y (xi , yi), I (xi , yi)) =

K∑k=1

(xk , yk , Ik)akg(r)(|Ik − I (xi , yi)|)

K∑k=1

akg(r)(|Ik − I (xi , yi)|)

. (4–7)

In the above equation (xk , yk) stands for a representative point (say, the centroid) of

the k th facet of the induced tessellation2 , and K is the total number of facets within

the specified spatial radius. Note that the area of the k th facet, i.e. ak , can also be

interpreted as the joint probability for the event x < X (x , y) < x + ∆x , y < Y (x , y) <

y +∆y ,α < I (x , y) < α+∆, if we assume a uniform distribution over the spatial variables

x and y . Here ∆ is the usual intensity bin-width, (∆x , ∆y) are the pixel dimensions,

and (x , y) is a pixel grid-point. The main difference between our approach and all

the aforementioned range-spatial domain approaches is the fact that we naturally

incorporate a weight in favor of the lower-gradient areas of the filter neighborhood.

Hence the mean-shift vector in our case will have a stronger tendency to move towards

the region of the neighborhood where the local intensity change is as low as possible

(even if a uniform spatial kernel is used). Moreover just like conventional mean shift,

our iterative procedure is guaranteed to converge to a mode of the local density in

a finite number of steps, by exploiting the fact that the weights at each point (i.e. the

areas of the facets) are positive. Hence Theorem 5 of [50] can be readily invoked.

This is because in Equation 4–7, the threshold function g(r) for the intensity is the

rectangular kernel, and hence the corresponding update formula is equivalent to one

2 The notion of the centroid will become clearer in Section 4.5.

78

with a weighted rectangular kernel, with the weights being determined by the areas of

the facets.

A major advantage of our technique is that the parameter ∆ can be set to as small

a value as desired (as it just means that more and more level curves are being used),

and the interpolation gives rise to a robust average. This is especially useful in the case

of small neighborhood sizes, as the intensity quantization is now no more limited by the

number of available pixels. In conventional mean-shift, the proper choice of bandwidth

is a highly critical issue, as very few samples are available for the local density estimate.

Though variable bandwidth procedures for mean-shift algorithms have been developed

extensively, they themselves require either the tuning of other parameters using rules of

thumb, or else some expensive exhaustive searches for the automatic determination of

the bandwidth [59], [60]. Although our method does require the selection ofWs andWr ,

the filtering results are less sensitive to the choice of these parameters in our method

than in standard mean shift.

4.5 Experimental Results

In this section we present experimental results to compare the performance of our

algorithm in a mean shift framework w.r.t. conventional kernel-based mean shift. For our

algorithm, we obtain a continuous function approximation to the digital image, by means

of piecewise linear interpolants fit to a triple of intensity values in half-pixels of the image

(in principle, we could have used any other smooth interpolant). The corresponding level

sets for such a function are also very easy to trace, as they are just segments within

each half-pixel. The level sets induce a polygonal tessellation. We choose to split the

polygons by the square pixel boundaries as well as the pixel diagonals that delineate the

half-pixel boundaries, thereby convexifying all the polygons that were initially non-convex

(see Figure 4-1C). Each polygon in the tessellation can now be characterized by the x , y

coordinates of its centroid, the intensity value of the image at the centroid, and the area

of the polygon. Thus, if the intensity value at grid location xi , yi is to be smoothed, we

79

choose a window of spatial radiusWs and intensity radiusWr around (xi , yi , I (xi , yi)),

over which the averaging is performed. In other words, the averaging is performed only

over those locations x , y for which (x − xi)2+(y − yi)2 <W 2s and |I (x , y)− I (xi , yi)| <Wr .

We would like to point out that though the interpolant used for creating the continuous

image representation is indeed isotropic in nature, this still does not make our filtering

algorithm isotropic. This is because polygonal regions, whose intensity value does not

satisfy the constraint |I (x , y)− I (xi , yi)| < Wr , do not contribute to the averaging process

(see the stopping term in Equation 4–3), and hence the contribution from pixels with

very different intensity values will be nullified.

4.5.1 Gray-scale Images

We ran our filtering algorithm over four arbitrarily chosen images from the popular

Berkeley image dataset [61], and the Lena image. To all these images, zero mean

Gaussian noise of variance 0.003 (per unit gray-scale range) was added. The filtering

was performed usingWs = Wr = 3 for our algorithm and compared to mean-shift

using Gaussian and Epanechnikov kernels with the same parameter. Our method

produced superior filtering results to conventional mean shift with both Gaussian and

Epanechnikov kernels. The results for our method and for Gaussian kernel mean shift

are displayed in Figures 4-2 and 4-3 respectively. The visually superior appearance

was confirmed objectively with mean squared error (MSE) values in Table 4-1. It should

be noted that the aim was to compare our method to standard mean shift for the exact

same setting of the parametersWr andWs , as they have the same meaning in all these

algorithms. Although increasing the value ofWr will provide more samples for averaging,

this will allow more and more intensity values to leak across edges.

4.5.2 Testing on a Benchmark Dataset of Gray-scale Images

Further empirical results with our algorithm (usingWS = Wr = 5) were obtained

on Lansel’s benchmark dataset [62]. The dataset contains noisy versions of 13 different

images. Each noisy image is obtained from one of three noise models: additive

80

Gaussian, Poisson, and multiplicative noise model, for one of five different values of

the noise standard deviation σ ∈ 5255, 10255, 15255, 20255, 25255

, leading to a total of 195 images.

We report denoising results on all these images without tweaking any parameters

depending on the noise model (we choseWr =Ws = 5 for all images at all noise levels).

The average MSE and MSSIM (an image quality metric defined in [63]) are shown in

the plots in Figures 4-5 and 4-6. We have also displayed the denoised versions of a

fingerprint image from this dataset under three different values of σ for additive noise in

Figures 4-5 and 4-6.

4.5.3 Experiments with Color Images

Similar experiments were run on colored versions of the same four images from the

Berkeley dataset [61]. The original images were degraded by zero mean Gaussian noise

of variance 0.003 (per unit intensity range), added independently to the R,G,B channels.

For our method, independent interpolation was performed on each channel and the joint

densities were computed as described in the previous sections. Level sets at intensity

gaps of ∆R = ∆G = ∆B = 1 were traced in every half pixel. Experimental results

were compared with conventional mean shift using a Gaussian kernel. The parameters

chosen for both algorithms wereWs = Wr = 6. Despite the documented advantages of

color spaces such as Lab [48], all experiments were performed in the R,G,B space for

the sake of simplicity, and also because many well-known color denoising techniques

operate in this space [45]. As seen in Figures 4-7, 4-8 and Table 4-2, our method

produced better results than Gaussian kernel mean shift for the chosen parameter

values.

4.5.4 Experiments with Chromaticity Vectors and Video

Two color images were synthetically corrupted with chromaticity noise altering just

the direction of the color-triple vector. These images are shown in Figures 4-9 and 4-10.

These images were filtered using our method and Gaussian kernel mean shift with a

spatial window of sizeWs = 4 and a chromaticity threshold ofWr = 0.1 radians. Note

81

that in this case, the distance between two chromaticity vectors ~v1 and ~v2 is defined to be

the length of the arc between the two vectors along the great circle joining them, which

turns out to be θ = cos−1 ~v1T ~v2. The specific expression for the joint spatial-chromaticity

density using the Gaussian kernel was e− (x−xi )

2+(y−yi )2

2W2s e− θ2

2W2i . The filtered images using

both methods are shown in Figures 4-9 and 4-10. Despite the visual similarity of the

output, our method produced a mean-squared error of 378 and 980.8, as opposed to

534.9 and 1030.7 for Gaussian kernel mean shift.

We also performed an experiment on video denoising using the David sequence

obtained from http://www.cs.utoronto.ca/~dross/ivt/. The first 100 frames from the

sequence were extracted and artificially degraded with zero mean Gaussian noise of

variance 0.006. Two frames of the corrupted and denoised (using our method) sequence

are shown in Figure 4-11, as also a temporal slice through the entire video sequence

(for the tenth row of each frame). For this experiment, the value of ∆ was set to 8 in our

method.

4.6 Discussion

We have presented a new method for image denoising, whose principle is rooted

in the notion that the lower-gradient portions of an image inside a neighborhood

around a pixel should contribute more to the smoothing process. The geometry of

the image level sets (and the fact that the spatial distance between level sets is inversely

proportional to the gradient magnitudes) is the driving force behind our algorithm.

We have linked our approach to existing probability-density based approaches, and

our method has the advantage of robust decoupling of the edge definition parameter

from the density estimate. In some sense, our method can be viewed as a continuous

version of mean-shift. It should be noted that a modification to standard mean-shift

based on simple image up-sampling using interpolation will be an approximation to

our area-based method (given the same interpolant). We have performed extensive

experiments on gray-scale and color images, chromaticity fields and video sequences.

82

To the best of our knowledge, ours is the first piece of work on denoising which explicitly

incorporates the relationship between image level curves and uses local interpo-

lation between pixel values in order to perform filtering. Future work will involve a

more detailed investigation into the relationship between our work and that in [58], by

computing the areas of the contributing regions with explicit treatment of the image

I (x , y) as a surface embedded in 3D. Secondly, we also plan to develop topologically

inspired criteria to automate the choice of the spatial neighborhood and the parameter

Wr for controlling the anisotropic smoothing.

It should be noted that the main aim of this chapter was to demonstrate the effect

of using interpolant information for denoising. Our contributions lie within the mean

shift framework, and therefore we have performed comparisons with other methods

that lie within this framework. For this reason, we have not performed experimental

comparisons with some leading local convolution approaches like [64] or [65].

LOW GRADIENTREGION (LARGE GAP BETWEENLEVEL SETS)

HIGHER GRADIENT REGION(LEVEL SETS CLOSELY PACKED)

CENTRAL PIXEL

A B

FACETS INDUCED BY LEVEL CURVES AND PIXEL GRID

C

Figure 4-1. Image contour maps in a neighborhood (A) with high and low gradientregions in a neighborhood around a pixel (dark dot); (B) a contour map of anRGB image in a neighborhood; red, green and blue contours correspond tocontours of the R,G,B channels respectively and the tessellation induced bythe above level-curve pairs contains 19 facets; (C) A tessellation induced byRGB level curve pairs and the square pixel grid

83

Table 4-1. MSE for filtered images using (M1) = Our method withWs =Wr = 3, using(M2) = Mean shift with Gaussian kernels withWs =Wr = 3 and (M3) = Meanshift with Gaussian kernels withWs =Wr = 5. MSE = mean-squared error inthe corrupted image. Intensity scale is from 0 to 255.

Image M1 M2 M3 MSE1 110.95 176.57 151.27 181.272 53.85 170.18 106.32 193.53 106.64 185.15 148.379 191.764 113.8 184.77 153.577 190

Lena 78.42 184.16 128.04 194.82

Table 4-2. MSE for filtered images using (M1) = Our method withWs =Wr = 6, using(M2) = Mean shift with Gaussian kernels withWs =Wr = 6 and (M3) = Meanshift with Epanechnikov kernels withWs =Wr = 6. MSE = mean-squarederror in the corrupted image. Intensity scale is from 0 to 255 for each channel.

Image M1 M2 M3 MSE1 319.88 496.7 547.9 572.542 354.76 488.7 543.4 568.693 129.12 422.79 525.48 584.244 306.14 477.25 526.8 547.9

84

Figure 4-2. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn). Both both methods,Ws =Wr = 3; Viewed best when zoomed inthe pdf file

85

Figure 4-3. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn). Both both methods,Ws =Wr = 3; Viewed best when zoomed inthe pdf file

86

Figure 4-4. Top left: original image, top right: degraded images with zero meanGaussian noise of std. dev. 0.003, bottom left: results obtained by ouralgorithm, and bottom right: mean shift with Gaussian kernel (right column).Both both methods,Ws =Wr = 3; Viewed best when zoomed in the pdf file

87

A B C

D E F

Figure 4-5. (A), (C) and (E): Fingerprint image subjected to additive Gaussian noise ofstd. dev. σ = 5

255, 10255

and 15255

respectively. (B), (D) and (F): Denoisedversions of (A), (C) and (E) respectively. Viewed best when zoomed in thepdf file (in color).

88

5 10 15 20 250

100

200

300

400

500

600

700

Sigma

Evaluations Averaged over all Test Images

Our spatial5 intensity5 AWGN

Our spatial5 intensity5 MWGN

Our spatial5 intensity5 Poisson

Noisy AWGN

Noisy MWGN

Noisy Poisson

5 10 15 20 25

0.4

0.5

0.6

0.7

0.8

0.9

1

Sigma

Evaluations Averaged over all Test Images

Our spatial5 intensity5 AWGN

Our spatial5 intensity5 MWGN

Our spatial5 intensity5 Poisson

Noisy AWGN

Noisy MWGN

Noisy Poisson

Figure 4-6. A plot of the performance of our algorithm on the benchmark dataset,averaged over all images from each noise model (Additive Gaussian(AWGN), multiplicative Gaussian (MWGN) and Poisson) and over all five σvalues, using MSE (top) and MSSIM (bottom) as the metric; Viewed bestwhen zoomed in the pdf file (in color)

89

Figure 4-7. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn); for both methods,Ws =Wr = 3; Viewed best when zoomed in thepdf file

90

Figure 4-8. For each image, top left: original image, top right: degraded images withzero mean Gaussian noise of std. dev. 0.003, bottom left: results obtainedby our algorithm, and bottom right: mean shift with Gaussian kernel (rightcolumn); For both methods,Ws =Wr = 3; Viewed best when zoomed in thepdf file

91

Figure 4-9. An image and its corrupted version obtained by adding chromaticity noise(top left and top right respectively). Results obtained by filtering with ourmethod (bottom left), and with Gaussian mean shift (bottom right); Viewedbest when zoomed in the pdf file (in color)

92

Figure 4-10. An image and its corrupted version obtained by adding chromaticity noise(top left and top right respectively). Results obtained by filtering with ourmethod (bottom left), and with Gaussian mean shift (bottom right); Viewedbest when zoomed in the pdf file (in color)

93

Figure 4-11. First two images: frames from the corrupted sequence. Third and fourth:images filtered by our algorithm. Fifth and sixth images: a slice through thetenth row of the corrupted and filtered video sequences; images arenumbered left to right, top to bottom

94

CHAPTER 5A RELATED PROBLEM: DIRECTIONAL STATISTICS IN EUCLIDEAN SPACE

5.1 Introduction

When the samples do not reside in Euclidean space, conventional density

estimation techniques such as mixtures of Gaussians or kernel density estimation

(KDE) using Gaussian kernels are not applied directly. For the special case when the

data reside on Sn, i.e. the sphere embedded in Rn, there exists extensive literature from

the field of directional statistics that is summarized in several exemplary books such as

[66]. Conventionally, for KDE or mixture model density estimation of unit vectors, the

Gaussian kernel has been replaced by von-Mises or von-Mises Fisher (voMF) kernels

for circular and spherical data respectively. These computational techniques have

been applied for solving numerous problems in computer vision, image processing,

medical imaging and computer graphics. Mixture modeling for directional data was

proposed originally by Kim et al. [67]. Banerjee et al. [68] also proposed a mixture

model for directional data and applied it for clustering problems. In medical imaging,

McGraw et al. [69] have modeled the displacement of water molecules in high angular

resolution diffusion images by means of a voMF mixture model. More recently, mixture

models of circular data have also been used for trajectory shape analysis in studying

object motion [70]. KDE of unit-vector data has been used in the context of smoothing

chromaticity vectors in color images [48]. Applications of such density estimators

in computer graphics include the work on approximation of the Torrance-Sparrow

Bidirectional Reflectance Functions (BRDF) as reported in [71], or the recent work in

[72] for approximating the distribution of surface normals. Eugeciouglu et al. [73] use a

kernel based on powers of cosines instead of a voMF in KDE, motivated by the superior

computational speed of the cosine estimator, and apply their technique for the analysis

of flow vectors in fluid mechanics.

95

The above techniques ignore the fact that the directional data are often obtained

as a transformation of the original measurements which are typically assumed to

reside in Euclidean space. Therefore the true probability density of the unit vector data

is related to that of the original data by means of a relationship dictated by random

variable transformations, a key concept in basic probability theory [74]. However,

a kernel density estimate or a mixture model estimate using (say) voMF kernels

ignores this very fundamental relationship. The technique proposed here exploits

exactly this relationship in the following way: (1) It performs density estimation in the

original space, and (2) It then transforms this density to the directional space using

random variable transformations. Thereby, it avoids the aforementioned inconsistency.

Secondly, conventional density estimation techniques for directional data also require

the solution of complicated nonlinear equations for key parameter updates such as

the covariance. This issue is completely circumvented by the presented technique.

A density estimator is built also for another directional quantity: hue in color images

(part of the HSI or hue-saturation-intensity color model), which is computed from a very

different transformation of the RGB color values obtained from a sensor (camera).

This chapter is organized as follows. Section 5.2 is a review on the choice of

kernel for density estimation for circular and spherical data. The drawbacks of these

approaches are enumerated and a new approach to density estimation for directional

data is introduced. This concept is extended for hue data in Section 5.3. A discussion is

presented in Section 5.4.

5.2 Theory

In this section, the theory of the new method is presented, starting with a review on

the choice of kernels for directional density estimation in contemporary vision literature.

5.2.1 Choice of Kernel

There exist a plethora of kernels used for estimating the density for unit vector

data, and the reasons for choosing one over the other require careful study. For KDE

96

of directional data, the voMF kernel is highly popular [69]. It has great computational

convenience because (1) it is symmetric, (2) it yields elegant closed-form formulae

for the Renyi entropy of a voMF mixture model, and for the distance between two

voMF distributions [69], and (3) the information-geometric properties of voMF mixtures

are simple [69]. Despite these algebraic properties, there are ambiguities [75] in the

oft-repeated [68] notion that the voMF is the ‘spherical analogue of the Gaussian’. The

voMF distribution does possess properties similar to a Gaussian such as those related

to maximum likelihood, and maximum differential entropy for fixed mean and variance,

besides symmetry. However, the voMF also differs from the Gaussian in the sense that

(1) the central limit theorem on the sphere does not involve the voMF but a uniform

distribution instead [75], (2) the voMF is not the solution to the isotropic heat equation

on the sphere [67] and (3) the convolution of two voMF distributions does not produce

exactly another voMF [66]. If we restrict ourselves to just the non-negative orthant of the

sphere (i.e. axial data), then the Bingham distribution also possesses many properties

similar to the Gaussian [75]. Another popular kernel for axial statistics is the Watson

distribution [76]. Some papers even consider a symmetrized version of the voMF kernel,

for instance [69]. However, the choice between Bingham, Watson and symmetrized

voMF kernels is unclear, and they will produce different density estimates for finite

sample sizes. Often, the motivation for choosing one over the other is computational

convenience, which is the chief reason behind the popularity of the voMF kernel.

5.2.2 Using Random Variable Transformation

The aforementioned density estimation techniques for directional data typically

assume that only the final unit vector data are available. However, very often in

computer vision applications, the original data are available as the output of a sensor.

These are then converted into unit vectors typically (though not always - see Section

5.3) by means of a projective transformation (unit normalization). The best instance

thereof is that color images are output by a camera usually in RGB format and

97

the intensity triple at each pixel is unit-normalized to produce chromaticity vectors.

Similarly, surface normals output by a 3D scanner are unit-normalized to produce the

corresponding unit vectors. KDE or mixture modeling techniques for spherical data are

applied thereafter.

The new approach to density estimation for directional data that directly exploits

the fact that the unit vectors are a transformation of the original data, is now presented.

Consider the original data to be a random variable X with a probability density function

(PDF) p(X ). Let Y = f (X ) be a known function of X . Then the PDF of Y is given by

p(Y = y) =

∫f −1(y)

p(X = x)dx

|f ′(x)|. (5–1)

Here f −1(y) represents the set of all those values x such that f (x) = y . This is known as

a random variable transformation in density estimation [74], and is a very fundamental

concept in probability theory.

This principle for estimating the density of unit vectors is presented as follows.

Let the original random variable in R2 be ~W having density p( ~W ) and let ~V = ~W|W | =

g( ~W ) be its directional component. Clearly, ~V is defined on S1. Let ~w = (x , y) be a

sample of ~W and ~v = ~w|~w | be the directional component of ~w . Let the polar coordinate

representation of ~w be (r , θ). Now, the the joint density of (r , θ) is given by

p(r , θ) =p(x , y)

| ∂(r ,θ)∂(x ,y)

|= rp(x , y). (5–2)

By integrating out the radius, we have the density of θ, i.e. the density of the unit-vector

~v = ~w|w | , as follows:

p(~V = ~v) =

∫ ∞

r=0

p(r , θ)dr =

∫ ∞

r=0

rp(x , y)dr . (5–3)

98

If ~w is a sample from an isotropic Gaussian distribution of variance σ2 and centered at

(0, 0), then it follows that

p(~V = ~v) =1

2πσ2

∫ ∞

r=0

re−r2

2σ2 dr =1

2π. (5–4)

If ~w is a sample from an isotropic Gaussian distribution of variance σ2 and centered at

(x0, y0), then it follows that

p(~V = ~v) =1

2πσ2

∫ ∞

r=0

re−(r2+r20−2rr0 cos (θ−θ0))/(2σ

2)dr (5–5)

where (r0, θ0) is a polar coordinate representation for (x0, y0). Upon simplification, we

have:

p(~v) =1

2πσ2σ2 exp

(− r

20

2σ2

)+ σ

√π

2r0 cos (θ − θ0)

(1 + erf

(r0 cos (θ − θ0)

σ√2

)) exp

(−r 20 sin2 (θ − θ0)

2σ2

). (5–6)

As seen from the previous equations, a random variable transformation of a vector-valued

Gaussian random variable followed by marginalization over the magnitude component

does not yield a von-Mises distribution. In fact, a von-Mises is obtained by condition-

ing the value of r to some constant (typically r = 1) as opposed to integrating over r

(see pages 107-108 of [5], and [77]), and therefore represents a conditional and not a

marginal density. The density in Equation 5–6 above is known in the statistics literature

as one corresponding to a projected normal distribution [78] or angular Gaussian

distribution [66], however it has not been introduced in the computer vision community

so far to the best of this author’s knowledge. Furthermore, it has not been employed in a

KDE or mixture modeling framework so far (see Section 5.2.3 and Section 5.2.4).

5.2.3 Application to Kernel Density Estimation

Now, suppose ~w follows some unknown distribution. The density of ~w is conventionally

approximated by means of kernel methods acting on N samples of the random variable.

If a Gaussian kernel centered at each sample and having variance σ2 is used, then we

99

have:

p(~w) =1

2πNσ2

N∑i=1

exp

(−|~w − ~wi |2

2σ2

). (5–7)

The earlier procedure will yield us the following estimate of the density of ~v :

p1(~v) =

∫ ∞

r=0

p(r , θ)dr

=

∫ ∞

r=0

r

2Nπσ2

N∑i=1

e−(r2+r2i −2rri cos (θ−θi ))/(2σ

2)dr (5–8)

where (ri , θi) is the standard polar coordinate representation for the sample point

~wi = (xi , yi). After evaluating the integral, we obtain the following expression:

p1(~v) =1

2πNσ2

N∑i=1

[σ2 exp

(− r

2i

2σ2

)+ σ

√π

2ri cos (θ − θi)

(1 + erf

(ri cos (θ − θi)

σ√2

)) exp

(−r 2i sin

2 (θ − θi)

2σ2

)]. (5–9)

Let p2(~v) be the estimate of the density of θ using the popular von-Mises kernel with a

concentration parameter κ. Then we have:

p2(~v) =1

2πI0(κ)N

N∑i=1

eκ~vT

~wi|~wi | (5–10)

where I0(κ) is the modified Bessel function of order zero. It is easy to see that for finite

sample sizes, p1(~v) 6= p2(~v) in general, even if a suitable variable bandwidth kernel

density estimate is used for p2(~v). Equation 5–9 is clearly different from a superposition

of von-Mises kernels, and can be considered as a directional density estimator for

unit-vector data on S1 obtained by a unit-normalization operation of original data in R2,

using a new kernel G:

p(~v) =1

N

N∑i=1

G(~v ; ~wi ,σ) (5–11)

where G is defined as follows:

G(~v ; ~wi ,σ) =1

2πexp

(−|~wi |2

2σ2

)+

1

2√2πσ

~v · ~wi

100

(1 + erf

(~v · ~wiσ√2

))· exp

(−| ~wi |2 + (~v · ~wi)2

2σ2

). (5–12)

A similar PDF can be defined for unit vector data (denoted as ~v ) on S2, obtained by

projective transformation of data denoted as ~w = (x , y , z) residing in R3 belonging to

an isotropic Gaussian distribution centered at ~wi = (xi , yi , zi). This yields the following

expression:

p(~v) =

∫ ∞

r=0

r 2e−(x−xi )

2+(y−yi )2+(z−zi )

2

2σ2 dr . (5–13)

p(~v) =e

− ~|wi |2+(~v .~wi )

2

2σ2

2σ2(2π)1.5(√π

2[erf(

~v . ~wi

σ√2) + 1][(~v . ~wi)

2 + σ2] +σ

2~v . ~wie

− (~v .~wi )2

2σ2

). (5–14)

The key feature of the kernel density estimation approach in this section (and also the

pith of this chapter, in general) is that the model-fitting (selection of parameters such as

σ) can all be done in Euclidean space. The new kernels proposed in Equations 5–12

and 5–14 appear only in an emergent way out of the random variable transformation.

5.2.4 Mixture Models for Directional Data

Existing mixture modeling algorithms have difficulties associated with the choice

of the number of mixture components and local minima issues during model fitting.

Additionally, there are other practical difficulties involved in mixture modeling for the case

of directional data. Firstly, if von-Mises kernels [68] are used, the maximum-likelihood

estimate of the variance (or concentration parameter, often denoted as κ) is not

available in closed form and requires the solution to a non-linear equation involving

Bessel functions. In [68], the parameter κ is updated using various approximations for

the Bessel functions that are part of the normalization constant for voMF distributions,

followed by the addition of an empirically discovered bias that is a polynomial function of

the estimated mean vectors. The difficulties faced by a mixture of voMF distributions in

modeling data that are spread out anisotropically are mitigated by the use of a mixture

101

of Watson kernels as claimed in [76]. Nonetheless, iterative numerical procedures to

estimate κ are still required, and the case where a full covariance is to be obtained, will

be even more complicated. Moreover the method in [76] also requires solving non-linear

equations for the update of the centers of the individual components over and above

the κ values. Over and above this, the update of the mean vectors in both [68] and [76]

involves vector addition followed by unit normalization, which is unstable if antipodal

vectors are involved as the norm of the resultant vector will be very small.

The approach based on the theory presented in the previous subsections

overcomes these difficulties by following a two-step procedure: (1) a mixture-model fit in

the original Euclidean space given a set of N samples, followed by (2) a transformation

of random variables. If a Gaussian mixture model is fit to the original data samples,

using M components, with priors pk, centers (µxk ,µyk) = (rk cos θk , rk sin θk) and

variances σk, then a random variable transformation results in the following form of

directional mixture model:

p(~v) =

∫ ∞

r=0

r

2πσ2

M∑k=1

pke−(x−µxk )

2+(y−µyk )2

2σ2k dr ,

p(~v) =

M∑k=1

pkG(~v ; ~µk ,σk), (5–15)

where G was defined in Equation 5–12. Since the entire mixture-modeling procedure is

performed in the original space, the aforementioned difficulties in estimating the mean

and concentration parameters are automatically avoided.

If we continue to follow this line of reasoning, we can now achieve a fresh

perspective on mixtures of voMF distributions as well. As mentioned previously and as

clearly documented in [5], the voMF distribution is obtained from a Gaussian distribution

by conditioning the magnitude of the random variable to be some constant. If we fit a

Gaussian mixture model to the original data and expressed it in polar coordinates, we

102

are left with the following expression:

p(r , θ) =M∑k=1

pk2πσ2k

e− (r cos θ−rk cos θk )

2+(r sin θ−rk sin θk )2

2σ2k . (5–16)

By conditioning on r = 1, we have:

p(θ|r = 1) =M∑k=1

pk2πI0(

rkσ2k)erk cos (θ−θk )

σ2k . (5–17)

This procedure basically suggests again that the entire mixture modeling algorithm

can be executed in Euclidean space, and that a mixture of voMF distributions can be

obtained by conditioning the magnitude of the random variable to be 1 (or some other

constant)1 . The polar coordinates transformations yield a formula for the concentration

parameter κk of the k th component, given as κk = rkσ2k

. This procedure therefore suggests

us a viable alternative to fitting a mixture of voMF distributions when the original data are

available (and not just the unit-vector data). Similar expressions can be derived for the

case of data on S2 derived from R3 as well.

5.2.5 Properties of the Projected Normal Estimator

The projected normal distribution is symmetric and unimodal just like the von-Mises

distribution. Figure 5-1 shows the projected normal distribution corresponding to an

original Gaussian distribution centered at ~µ0 = (1, 0) having a variance of σ0 = 10,

and a von-Mises distribution centered at (1, 0) with κ0 =|~µ0|σ20= 0.01. Similarly, plots

of the projected normal distribution on S2 for an original Gaussian distribution with

~µ0 = (1, 0, 0) and variance 10, and a voMF distribution with mean ~µ0 = (1, 0, 0)

and concentration κ0 =|~µ0|σ2

are shown in Figure 5-2. As indicated by the plots, both

distributions have a distinct peak at θ ≈ 54,φ ≈ 45 as expected.

1 Note that the voMF distribution or a mixture of voMF distributions are conditional andnot marginal distributions.

103

From Equations 5–9, 5–12 and 5–14, it can be seen that the density estimator does

not require the conversion of the original samples to unit vectors, but operates entirely in

the original space.

5.3 Estimation of the Probability Density of Hue

Directional data are usually obtained by the process of unit-normalization of

the original vector data measured by a sensor. However, this isn’t always the case.

For instance, color sensors typically output values in the RGB color format. These

values are then converted to other color systems such as HSI using transformations

of a different kind, presented below. The HSI color model is based on the notion of

separating a color into three quantities - the hue H (which is the basic color such as red

or green), the saturation S (which indicates the amount of white present in a color) and

the value I (which indicates the amount of shading or black). The component hue (H) is

an angular quantity. The rules for conversion between the RGB and HSI color models

are as follows [48]:

H = cos−1

(0.5(2R − G − B)√

(R − G)2 + (R − B)(G − B)

)S = 1− 3

R + G + Bmin(R,G ,B)

I =1

3(R + G + B). (5–18)

The inverse transformation from HSI to RGB, for hue values 0 < H ≤ 2π3

, is:

B = I (1− S)

R = I (1 +S cosH

cos(π3− H)

)

G = 3I − (R + B). (5–19)

For hue values 2π3< H ≤ 4π

3, the formulae are given by:

H = H − 2π3

R = I (1− S)

104

G = I (1 +S cosH

cos(π3− H)

)

B = 3I − (R + B). (5–20)

and for hue values 4π3< H ≤ 2π,

H = H − 4π3

G = I (1− S)

B = I (1 +S cosH

cos(π3− H)

)

R = 3I − (R + B). (5–21)

If p(R,G ,B) is the density of the RGB values, and taking into account the fact that the

RGB to HSI transformation is one-one and onto, the density of the HSI values is given

as:

p(H,S , I ) =p(R,G ,B)

| ∂(H,S,I )∂(R,G ,B)

|= |∂(R,G ,B)

∂(H,S , I )|p(R,G ,B). (5–22)

Now, for all hue values, we have:

|∂(R,G ,B)∂(H,S , I )

| = 2√3 sec2H

(1 +√3 tanH)2

[IS(1− S) + I 2S(S + 2)]. (5–23)

Supposing the RGB values were drawn from a Gaussian distribution centered at

(Ri ,Gi ,Bi) having variance σ2, then the distribution of HSI is given as:

p(H,S , I ) =

(2√3 sec2H

(1 +√3 tanH)2

[IS(1− S) + I 2S(S + 2)]

)(

1

σ3(2π)1.5e−

(R−Ri )2+(G−Gi )

2+(B−Bi )2

2σ2

). (5–24)

Further simplification gives

p(H,S , I ) =

(2√3 sec2H

(1 +√3 tanH)2

[IS(1− S) + I 2S(S + 2)]

)(

1

σ3(2π)1.5e−

(I+ISk−Ri )2+(I−IS−Bi )

2+(I+IS−ISk−Gi )2

2σ2

)(5–25)

105

where k = 2

1+√3 tanH

. To find the marginal density of hue, we integrate over the values of

I and S (both lying in the interval [0, 1]), giving us:

p(H) =

∫ 1I=0

∫ 1S=0

p(H,S , I )dSdI . (5–26)

Unlike the case with the preceding section, this formula is not available in closed form.

However it is easy to approximate this formula numerically, as it just involves a 2D

definite integral over a bounded range of values (of S and I ).

Instead of marginalizing, if we conditioned S and I to take on the value of 1, then

the conditional density of H is obtained as follows:

p(H|S = I = 1) =

(6√3 sec2H

σ3(2π)1.5(1 +√3 tanH)2

)(e−

(1+k−Ri )2+B2i +(2−k−Gi )

2

2σ2

)(5–27)

Notice that equation 5–27 is analogous to Equation 5–17 in the sense that both are

conditional densities (obtained by conditioning other variables to have constant values).

On the other hand, Notice that equation 5–26 is analogous to Equation 5–9 in the sense

that both are marginal densities (obtained by integrating out other variables).

We would like to draw the reader’s attention to the fact that both these approaches

are radically different from that proposed in [79]. The latter approach performs density

estimation of the hue by first converting the RGB samples to hue values. Then, it

centers a kernel with a different bandwidth around each hue sample. The value of the

bandwidth for the i th sample Hi is determined by the partial derivatives ∂Hi∂R, ∂Hi∂G, ∂Hi∂B

which indicates the sensitivity of Hi w.r.t. the original RGB values. As hue is a non-linear

function of RGB, the sensitivity in the hue values varies with the RGB values of the

samples obtained from the sensor. For instance, hue is highly unstable at RGB values

that are close to the achromatic axis R = G = B.

106

5.4 Discussion

Most techniques that estimate the PDF of directional data assume that only the

directional data are available. This fact is exploited to derive a new approach for density

estimation of directional data by first estimating the density in the original space followed

by a random variable transformation. Therefore, this is the only circular/spherical density

estimator in the computer vision community, which is consistent with the estimate of

the density of the original data from which the directional data are derived, in the sense

of random variable transformations, a key concept in probability theory. Secondly,

this method circumvents issues involved in solving complicated non-linear equations

that arise in maximum likelihood estimates for the parameters of conventional density

estimators, as it operates in the original space, and therefore uses the much simpler

mixture-modeling or KDE techniques that are popular for Euclidean data. The theory

for this estimator is built for unit-normal vectors as well as quantities such as hue in

color imaging. Though this work deals strictly with directional data, the underlying

philosophy of this approach is easily extensible to data residing on other kinds of

manifolds. Therefore it has the potential of posing as a viable alternative to existing

kernel density estimators that require the usage of non-trivial mathematical techniques

(such as computation of geodesic distances between samples on a given manifold) in

order to be tuned to data that reside on non-Euclidean manifolds [80].

The approach presented in this chapter also raises the following question. Consider

a random variable f whose estimated PDF (say using a kernel method) using samples

f1, f2, ..., fn is given by

pf (f = α) =1

n

n∑i=1

Kf (α− fi ;σf ). (5–28)

Now consider a transformation T of f , yielding the transformed random variable

g = T (f ). One method could be to apply a kernel density method to directly to the

transformed samples g1 = T (f1), g2 = T (f2), ..., gn = T (fn), yielding the density

107

estimate

pg(g = β) =1

n

n∑i=1

Kg(β − gi ;σg) (5–29)

where β = T (α). Alternatively, one could apply a random variable transformation to

pf (f = α) to yield

pg(g = β) =

∫γ=T−1(β)

pf (f = γ)

|T ′(γ)|. (5–30)

The relationship between pg(.) and pg(.) will depend upon the choice of kernels Kf (.)

and Kg(.) and the parameters σf and σg, which requires further investigation. Note

that the PDF estimator for image intensities from Chapter 2 follows the approach

in Equation 5–30 as it is an explicit random variable transformation from location to

intensity, whereas all the sample-based methods reviewed in Chapter 2 follow the former

approach in Equation 5–29.

Consider yet another scenario where the technique from Chapter 2 was used

to estimate the density of the intensity values in an image I (x , y). Now let J(x , y) =

T (I (x , y)) be a transformation of the image I . There are two ways to arrive at the PDF

of J(x , y) - one estimate (denoted p1(.)) is by interpolating the value of I , and then

applying the random variable transformation . The other estimate (denoted p2(.)) is

obtained by first computing the J values at the discrete locations and then interpolating

those values to yield another estimate of the density of J. In this case, the two estimates

would be related by the specific interpolants employed. Let us consider the specific case

where I (x , y) was an RGB image, and J(x , y) was the image of chromaticity vectors.

Consider that the interpolant used for I (x , y) was such that the directions of the subpixel

RGB values were spherical linear functions of the spatial coordinates, whereas the

magnitudes were linear functions of the spatial coordinates. Consider also that the

interpolant used for J(x , y) was spherical linear in nature. It can be seen easily that the

estimates p1(.) and p2(.) using these rules would be equal.

108

−200 −150 −100 −50 0 50 100 150 2002.4

2.5

2.6

2.7

2.8

2.9

3

3.1

3.2x 10

−4

PROJECTED NORMAL DISTRIBUTIONVON−MISES DISTRIBUTION

Figure 5-1. A projected normal distribution (~µ0 = (1, 0),σ0 = 10) and a von-Misesdistribution (~µ0 = (1, 0),κ0 =

|~µ0|σ20= 0.01)

0

50

100

150

200

0

20

40

60

80

1005.2

5.4

5.6

5.8

6

6.2

6.4

6.6

6.8

7

x 10−5

A

0

50

100

150

200

0

20

40

60

80

1006

6.02

6.04

6.06

6.08

6.1

6.12

6.14

x 10−5

B

0

50

100

150

200

0

20

40

60

80

1000

1

2

3

4

5

6

7

8

9

x 10−6

C

Figure 5-2. Plots of (A) a projected normal density (~µ0 = (1, 0, 0),σ0 = 10), (B) a voMFdensity (~µ0 = (1, 0, 0),κ0 =

|~µ0|σ20= 0.01), and (C) L1 norm of the difference

between the two densities

109

CHAPTER 6IMAGE DENOISING: A LITERATURE REVIEW

6.1 Introduction

In this chapter, we give a detailed review of contemporary literature on image

denoising. We make an attempt to cover as many diverse approaches as possible,

though a complete overview is beyond the scope of the thesis, given the sheer

magnitude of existing research on this topic. To the best of our knowledge, there

exist very few surveys on image denoising. The review in [2] focuses on mathematical

characteristics of the residual images (defined as the difference between the given

noisy and the denoised image) for different types of image filters ranging from

partial differential equations to wavelet based methods. A summary of recent trends

in denoising was presented by Donoho and Weissman at the IEEE International

Symposium on Information Theory (ISIT) in 2007 [81]. This tutorial focussed on wavelet

and other transform based methods, some learning based methods and non-local

methods. In the present review, we discuss and critique methods based on partial

differential equations, local convolution and regression, transform domain methods

using wavelets and the discrete cosine transform (DCT), non-local approaches,

methods based on analysis of the properties of residuals and methods that use

various machine learning tools. The aforementioned categories constitute the bulk

of modern image denoising literature. The focus of the survey is on gray-scale image

denoising, though we make occasional references to papers on color image denoising.

Throughout this chapter and in subsequent chapters, we consider noise to be a random

signal independent of the original signal that it corrupts. Apart from a descriptive

survey of the contemporary techniques as such, we also cover some common issues

concerning almost all contemporary denoising techniques: methods for validation of filter

performance and methods for automated parameter selection.

110

6.2 Partial Differential Equations

The isotropic heat equation was used for image smoothing in [82]. It is known

that executing this partial differential equation (PDE) on the image is equivalent to

convolution with a Gaussian kernel, where the kernel parameter (often denoted

by σ) is related to the time step and number of iterations of the PDE. However,

isotropic smoothing blurs away significant image features such as edges along with

the noise, and hence is not used in contemporary denoising algorithms. Instead, in most

contemporary diffusion methods, the diffusion process is directed by edge information

in the form of a diffusivity function which prevents blurring across edges and allows

diffusion along them [44]. The chosen diffusivity function is actually a monotonically

decreasing function of the gradient magnitude. The equation for the PDE can be written

as follows∂I

∂t= div(g(|∇I |)∇I ) (6–1)

where I : Ω → R is a gray-scale image defined on domain Ω and g(|∇I |) is a diffusivity

function typically defined as

g(|∇I |;λ) = 1

1 + |∇I |2/λ2. (6–2)

Several different diffusivity functions have been proposed, for instance those by Perona

and Malik [44], Weickert [83] and Black et al. [84]. A regularized version of the above

equation has been proposed in [85]. Connections between robust statistics and

anisotropic diffusion (which show up in the choice of diffusivity function) have been

established in [84].

Some PDEs are obtained from the Euler-Lagrange equations corresponding to

energy functionals. One example is the image total variation defined as

E(I ) =

∫Ω

|∇I (x , y)|dxdy (6–3)

111

giving rise to the PDE∂I

∂t= div(

∇I|∇I |). (6–4)

It should be noted that the aforementioned techniques are based on the assumption

that natural images are piecewise constant, which is not necessarily a valid assumption.

They also require the choice of the parameter λ in the diffusivity. This parameter need

not be constant throughout the image. The number of iterations for which these PDEs

are executed is an important parameter critical for good performance. In the limit of

infinite iterations, constant or piecewise-constant images are produced. Some authors

remedy the stopping time selection issue by introducing a prior model in the energy

formulation, for example the following modification of the total variation model, starting

with an initial image I0

E(I ) =

∫Ω

|∇I (x , y)|dxdy + µ∫Ω

(I (x , y)− I0(x , y))2dxdy (6–5)

where µ is a parameter that trades data fidelity with regularity. The implicit assumption

in the term (I (x , y) − I0(x , y))2 is a Gaussian noise model. Assuming that the image

has been corrupted with zero mean Gaussian noise of known variance σ2n, a constrained

version of the objective function has been proposed in [86]:

minIE(I ) =

∫Ω

|∇I (x , y)|dxdy (6–6)

subject to ∫Ω

(I − I0)2dxdy = σ2n (6–7)∫Ω

I (x , y)dxdy =

∫Ω

I0(x , y)dxdy . (6–8)

For different noise models, such as Poisson or impulse noise, different priors can be

used [87]. A highly comprehensive review of several such PDE-based approaches can

be found in exemplary books such as [83] and [53], to name a few. Recently, some

authors have also introduced the concept of diffusion with complex numbers, which

112

brings about denoising in conjunction with edge enhancement [88], [89]. The latter

technique performs the complex diffusion by treating the image I : Ω → R as a graph of

the form (x , y , I (x , y)), a framework for diffusion developed in [58].

Some researchers have developed PDEs based on a piecewise linear assumption

on natural images, examples being [90] and [91]. These turn out to be fourth order

PDEs and their energy functions penalize deviation in the intensity gradient as opposed

to deviation in intensity, and preserve fine shading better. However in some cases such

as [90], speckle artifacts have been observed which need to be retroactively remedied

using median filters [90]. Another class of approaches consists of independently filtering

the gradients in the x and y directions, and then using some prior assumption on the

image geometry to reconstruct the image intensity from the smoothed gradient values

[92].

6.3 Spatially Varying Convolution and Regression

A rich class of techniques for image filtering involve the so-called spatially varying

convolutions. In these methods, an image is convolved with a pointwise varying mask

which is derived from the local geometry extracted from the signal. A closely related

idea is the modeling of the local geometry of an image (signal) by means of a low-order

polynomial function. The signal is approximated locally by a pointwise-varying weighted

polynomial fit. The coefficients of the polynomial are computed by a least-squares

regression, and these are then used to compute the value of the (filtered) signal at

a central point. For instance, the signal could be modeled as follows, restricted to a

neighborhood Ω around a point x0:

I (x) = a0 +

m∑i=1

ai(x − x0)i (6–9)

where ai (0 ≤ i ≤ m) are coefficients of the polynomial. These coefficients are

obtained by least squares fitting, and the filtered signal value is given by I (x0) = a0.

This procedure is not guaranteed to preserve edges as it allows even disparate intensity

113

values to affect the polynomial fit. Instead in practice, the signal is modeled as follows:

I (x) = a0 +

m∑i=1

aiw(x − x0, I (x)− I (x0); hs , hv)(x − x0)i . (6–10)

Here w(x − x0, I (x) − I (x0); hs , hv) is a weighting scheme which is basically a

non-increasing function of the difference between spatial locations, i.e. x − x0, and

the difference between the signal values at those locations, i.e. I (x) − I (x0). The

function w is parameterized by hs and hv which act as spatial and intensity smoothing

parameters respectively. The fitting procedure is now, of course, a weighted least

squares regression. These ideas trace back to the Savitzky-Golay filter [93], [94] and

are the subject of beautiful books such as [95]. Two-dimensional versions of these ideas

have been recently used in modified forms for image filtering applications in [65] and

[55]. In [65], the parameter hs is replaced by a matrix, which is selected in a manner

dictated by local image edge geometry and no penalty is applied on intensity deviation.

On the other hand, in the latter case [55], the weights for regression are affected solely

by intensity difference. The popular bilateral filtering technique [49], [96] is again based

on a weighted linear combination of intensities, with weights driven by both location and

intensity differences. In fact, the kernel regression approach in [65] has been framed as

a higher-order generalization of the bilateral filter. If the polynomial order is restricted to

one and the weights are applied only on intensity differences, one gets the so-called the

kernel density based filter [48], also called as the anisotropic neighborhood filter [55].

A version where the weights are obtained from intensity gradient magnitudes has been

presented in [47] and is called as the adaptive filter. An extension to the anisotropic

neighborhood filter using interpolation between noisy image intensity values (and the

induced isocontour map) has been recently presented by us in [11] and in Chapter

4. In all these techniques, a crucial parameter is the size and also the shape of the

neighborhood for local signal modeling. An important contribution toward solving this

114

problem is a data-driven approach presented in [97], which derives a multi-directional

star-shaped neighborhood (of largest possible size) around each image pixel.

The mean-shift procedure, a clustering technique proposed in [98], and applied to

filtering (and segmentation) in [51], can be considered as a generalization of bilateral

filtering, where the window for local signal modeling is allowed to grow dynamically.

This growth is directed by an ascent on a local joint density function of spatial as well

as intensity values. It should be noted that both bilateral filtering and mean shift are

related to the Beltrami flow PDE developed in [58]. These relationships have been

explored in [99]. The connections between nonlinear diffusion PDEs over small periods

of time and spatially varying convolutions have been shown in [100]. In [45], the authors

present so-called trace-based PDEs for smoothing of color images and prove that the

corresponding diffusion is exactly equivalent to convolutions with oriented Gaussians,

where the orientation is dictated by local image geometry or edge direction.

Thus, spatially varying convolutions for filtering have a rich history. The most recent

contribution in this area is the one presented in [64] and [101]. This framework is based

upon the Jian-Vemuri continuous mixture model from the field of diffusion-weighted

magnetic resonance imaging (DW-MRI) [102]. In [64], complicated local image

geometries such as edges as well as X, Y or T junctions are modeled using a Gabor

filter bank at different orientations. The collection of Gabor-filter responses is expressed

as a discrete mixture of a continuous mixture of Gaussians (with Wishart mixing

densities) or a discrete mixture of a continuous mixture of Watson distributions (with

Bingham mixing densities) to respectively yield two different types of kernels for local

geometry-preserving convolutions. The number of components of the discrete mixture

is given by an appropriate sampling of the 2D orientation space and the weights of the

discrete mixture are solved by local regularized least squares fitting. The novelty of this

technique is (1) the automatic setting of weights for geometry-preserving smoothing,

and (2) the ability to preserve features such as image corners and junctions (which are

115

ignored by the other convolution-based methods mentioned before). While techniques

such as curvature-preserving PDEs [103] attempt preservation of such geometries,

their behavior at X, Y or T junctions (where curvature is not defined) may need further

exploration.

The mean shift procedure or other local convolution filters can also be applied to

the image gradients to better facilitate the preservation of shading. An extensive survey

of various applications with different types of filtering operations on image gradients,

followed by image reconstruction using a projection onto the nearest integrable surface

[104] or by solving the Poisson equation, has been presented in [105] in a short course

at the International Conference on Computer Vision, 2007, and in papers such as [106].

6.4 Transform-Domain Denoising

Transform-domain denoising approaches typically work at the level of small image

patches. In these approaches, the image patch is projected onto a chosen orthonormal

basis (such as a wavelet basis or the DCT basis) to yield a set of coefficients. It is

well-known that the coefficients in the transform domain are highly compressible in the

sense that the vast majority of these coefficients are very close to zero. In the literature,

this property is referred to as ‘sparsity’, though in a strict sense, sparsity would require

most coefficients to be equal to zero. In the rest of the thesis, we shall stick to this

usage of the word ‘sparsity’ even though we imply compressibility. It is known that the

coefficients in the wavelet or DCT transform domain are decorrelated from one another

[107]. It should be noted that the smaller coefficients usually correspond to the higher

frequency components of the signal which are often dominated by noise. To perform

denoising, the smaller coefficients are modified (typically, those coefficients whose

magnitude is below some λ are set to zero, in a process termed ‘hard thresholding’), and

the patch is reconstructed by inversion of the transform. This procedure is repeated for

every patch. If the patches are chosen to be non-overlapping, one can observe seam

artifacts at the patch boundaries. Furthermore, the thresholding of the coefficients is

116

also known to produce ringing artifacts around image edges or salient features. Artifacts

of both types can be remedied by performing the aforementioned three steps in a sliding

window fashion from pixel to pixel. This yields an overcomplete transform as each

pixel now acquires multiple hypotheses from overlapping patches. These hypotheses

are aggregated (typically by simple averaging) together to yield a final estimate. This

process of averaging of multiple hypotheses has been reported to consistently yield

superior results [108], [109], and is termed ‘translation invariant denoising’, or ‘cycle

spinning’ [108].

The performance of transform-based techniques is affected by the following

parameters: the choice of basis, the choice of a thresholding mechanism, a method

for aggregation of overlapping estimates and the patch size. We discuss these points

below.

6.4.1 Choice of Basis

Somewhat surprisingly, it has been observed that the sliding window DCT

outperforms most wavelet bases [109]. However, given a library of orthonormal bases,

the choice of the best one (from the point of view of denoising) from amongst these, is

largely an open problem in signal processing. In many existing approaches, the image

patch (of size n1 × n2) is represented as a matrix and the bases for representation are

obtained from the outer product of the bases that represent the rows with the bases

that represent the columns [109], [108]. This is called as a separable representation. In

other cases, the image patch is represented as a 1D vector of size n1n2 using a basis

of size n1n2 × n1n2. In the separable case, it has been observed that the transform may

be biased towards images whose salient features are aligned with the Cartesian axes.

If the local image geometry deviates from these axes, the transform may not be able

to represent them compactly enough. This has been remedied by using non-separable

bases such as the steerable wavelet [110], or the curvelet transform [111], which are

designed by taking image geometry into account.

117

6.4.2 Choice of Thresholding Scheme and Parameters

The most common thresholding method is hard thresholding, given as follows:

T (c ;λ) =

c if |c | ≥ λ

0 if |c | < λ.(6–11)

Another popular method, known as soft thresholding, not only nullifies coefficients

smaller than the threshold but also reduces the value of coefficients that are larger than

the threshold. Mathematically, soft thresholding is expressed as follows:

T (c ;λ) =

c − λ if |c | > λ, c > 0

c + λ if |c | > λ, c < 0

0 if |c | < λ

(6–12)

There exist several other thresholding schemes (or rather, schemes for modification of

transform coefficients). These methods can be interpreted as the result of minimizing

different types of risk functions. For example, the hard thresholding scheme (sometimes

termed as the best subset selection problem) is the result of minimizing the hard

threshold penalty, soft thresholding has an interpretation in terms of minimizing

the L1 penalty, whereas minimization of the smoothly clipped absolute deviation

(SCAD) leads to a thresholding scheme that lies intermediate between hard and soft

thresholding (see Figures 1 and 2 and Section (2.1) of [112]). Almost all these methods

of thresholding lead to monotonic functions of the coefficient magnitude. Despite the

several sophisticated thresholding functions available, the best denoising results that

have been reported using wavelet transforms are the ones with hard thresholding,

with a translation invariant approach [62]. The choice of the parameter λ has been

studied in detail in the community. For instance, in [113], the authors prove that under

a hard thresholding scheme, the choice λ = σn√2 logN is optimal from a statistical

risk standpoint, under zero mean Gaussian noise of standard deviation σn, where

N is the size (i.e. number of pixels) of the image/image patch (see Theorem (4) and

118

Equation (31) of [113]). In the experiments to be presented in Chapter 7, we have

observed empirically that the threshold λ = 3σ produces excellent denoising results for a

Gaussian noise model with 8×8 patches, which approximately tallies with the result from

[113]. This is in tune with an empirically observed fact that the coefficients of a Gaussian

random matrix of standard deviation σ when projected on an orthonormal basis are less

than 3σ with a very high probability.

6.4.3 Method for Aggregation of Overlapping Estimates

The most common approach for aggregation is a simple averaging of (or a median

operation on) all the hypotheses generated for the pixel.

6.4.4 Choice of Patch Size

The patch size choice presents the classical bias-variance tradeoff. Very small

patches allow preservation of finer details of the image but may overfit (undersmooth),

whereas larger patch sizes perform better in smoothing larger homogeneous regions

but may oversmooth some subtle details. Very little work exists on optimal patch

size selection. In fact, the patch size need not be constant throughout the image

and can vary as per local geometry. Some papers such as [114] propose the use of

multi-scale approaches by combining estimates at different scales. However the optimal

combination of such estimates is still a problem, much like that of optimal aggregation of

overlapping estimates. We present a correlation-coefficient criterion for the automated

selection of a single global patch size in Chapter 7.

A common criticism of transform-domain thresholding techniques (especially hard

thresholding) is their inability to distinguish high frequency information from noise. Some

authors try to remedy this by observing that there exist dependencies that arise in a

transform coefficients at the same spatial location but at different scales [115], or at

adjacent spatial locations [116]. These dependencies are exploited by using multivariate

thresholding methods. For instance in [115], bivariate shrinkage rules are developed,

which exploit the interdependency between coefficients at two adjacent scales leading

119

to superior image denoising performance. Another popular wavelet-based denoising

technique which exploits interdependency of the coefficients is the BLS-GSM (Bayesian

least squares for Gaussian scale mixtures) developed in [117]. This method assumes

that the distribution of a neighborhood of wavelet coefficients (defined as coefficients at

adjacent scales, orientations or locations) can be modeled as a Gaussian scale mixture

(a positive hidden variable multiplied by a Gaussian random variable). Assuming a

suitable prior on this hidden random variable, and given a set of wavelet coefficients

from a noisy image, one can form an estimate of the true wavelet coefficient given its

neighbors using a Bayesian least squares method.

It should be noted that estimates using coefficient thresholding schemes are shown

to be maximum a posteriori (MAP) estimates of the true signal coefficients given those

of the degraded signal, by making suitable assumptions on the statistics of wavelet

coefficients of clean natural images [116], [118]. Typically, the generalized Gaussian

family yields an excellent prior for the densities of natural image wavelet coefficients

[119]. This prior can be written as follows:

p(z ;σp, p) ∝ e−| z

σp|p (6–13)

A Gaussian prior (p = 2) is known to yield the empirical Wiener estimate for the

coefficients of the true image, a Laplacian prior (p = 1) corresponds to the soft

thresholding scheme and the hard thresholding scheme is approximated by smaller

values of p [116]. Doubting the validity of these priors for every natural image in

question, the authors of [120] learn a minimum mean square error (MMSE) estimator

for the true wavelet coefficients given the corresponding noisy coefficients. For this

purpose, they build a training set of patches from clean natural images and their

degraded version (assuming a fixed noise model). Following this, they solve a simple

regression problem to optimally perturb coefficients of the degraded patches so as

to yield values close to those of the corresponding clean patches. For overcomplete

120

representations, the authors of [120] report that the regression procedure produces

non-monotonic thresholding functions, a deviation from all earlier thresholding schemes

driven by image priors.

6.5 Non-local Techniques

These techniques which were popularized by the recent ‘non-local means

(NL-Means)’ algorithm, published in [2] and [121], exploit the fact that natural images

(and especially textures) often contain several patches that are very similar to each

other (as measured in the L2 sense, for instance). NL-Means obtains a denoised image

by minimizing a penalty term on the average weighted distance between an image

patch and all other patches in the image, where the weights are dependent on the

squared difference between the intensity values in the patches. This is expressed below

mathematically:

I = argminIE(I ) (6–14)

E(I ) =−1β

∑xi ,yi

log

[∑xj ,yj

exp

(−β‖I (0−)patch(xi , yi)− I

(0−)patch(xj , yj)‖

2

)](6–15)

where I (0−)patch(xi , yi) is a patch centered at pixel (xi , yi) of the image I excluding the central

pixel. Taking the derivative of E(I ) with respect to any pixel value I (xi , yi) and setting it to

zero, yields us the following update equation:

I (xi , yi) =

∑xj ,yjwj I (xj , yj)∑xj ,yjwj

(6–16)

wj = exp(−β‖I (0−)patch(xi , yi)− I

(0−)patch(xj , yj)‖

2). (6–17)

It can be observed from the previous equation that NL-Means is a pixel-based algorithm

and indeed can be interpreted as a spatially varying convolution in which the convolution

mask is derived using non-local image similarity.

Usually, just one update step yields good results [2]. However, for higher noise

levels, the algorithm can certainly be iterated several times [122]. The implicit assumption

121

in NL-Means is that patches that are similar in a noisy image will also be similar in the

original image, as noise is i.i.d. The essential principle of image self-similarity underlying

NL-Means is the same as the one used in fractal image coding methods [123]. The

NL-Means algorithm can also be interpreted as a minimizer of the conditional entropy

of a central pixel value given the intensity values in its neighborhood [124], [125], and

hence it is rooted in similar principles as the famous Efros-Leung algorithm for texture

synthesis [126]. The conditional entropy is estimated from the conditional density

which is obtained using only the noisy image in [124] or an external patch database in

[125]. Other denoising algorithms that exploit image self-similarity include [127] or the

long-range correlation method proposed in [128] and [129]. A variational formulation

for the NL-Means technique is presented in [130]. The concept of non-local similarity

is typically used only in the context of translations, but can also be extended to handle

changes in rotation, scale or affine transformations, as also changes in illumination.

Such models have been studied in [131], [132].

Critique: The performance of the NL-Means algorithm will be affected in those

regions of an image which do not have similar patches elsewhere in the image. The

performance of the technique is also dependent on the parameter β and the patch size.

Indeed, for large values of β, for large patch sizes or if the algorithm is iterated several

times, the residuals produced may discernible image features (see Figure 9 of [122]).

It is easy to interpret one iteration of NL-means as the product of a row-stochastic

matrix A1 of size N × N with the noisy image (represented as column vector). Here

N is the number of pixels. The entries of A1 are given by the weights from Eqn. 6–17.

If NL-Means is executed iteratively, the weight matrix will change. Let us denote the

weight-matrix at the i th iteration by Ai . Therefore after multiple iterations, the resulting

image is obtained from the product of the matrix Aπ with the original vectorized image,

where Aπ is given by

Aπ =

n∏i=1

Ai . (6–18)

122

It has been proved recently [133] that the limiting product of any sequence of row-stochastic

matrices yields a matrix with all rows identical to one another. When such a matrix is

multiplied with the image vector, it invariably produces a flat image. This theorem is

mentioned in Appendix B. This proves that the limit of the NL-Means algorithm is a flat

image.

The aforementioned non-local formulation has led to the development of the BM3D

(block matching in three dimensions) method [134] which is considered the current

state of the art in image denoising with excellent performance shown on a variety of

images. This method operates at the patch level and for each reference patch in the

image, it collects a group of similar patches. In the particular implementation in [134],

similarity is defined in terms of the Euclidean distance between pre-filtered patches.

These similar patches are then stacked together to form a 3D array. The entire 3D array

is projected onto a 3D transform basis, where coefficients below a selected threshold

value are set to zero. The filtered patches are then reconstructed by inversion of the

transform. This process is repeated over the entire image in a sliding window fashion.

At each step, all patches in the group are filtered and the multiple hypotheses generated

for a pixel are averaged. The authors term the collective filtering of a group of patches

as ‘collaborative filtering’ and claim that the group of patches exhibit greater sparsity

collectively than each individual patch in the group, citing that as the reason for the state

of the art performance of the BM3D method. In the specific implementation in [134],

the 3D transform is implemented in the following way. First, the individual patches are

filtered by projection onto a 2D transform basis (in this case the 2D DCT basis) followed

by hard thresholding of the coefficients. Once all these patches in the group are filtered

individually, each pixel stack (consisting of the corresponding pixels from all the patches)

is again filtered by means of a 1D Haar transform. The multiple hypotheses appearing at

any pixel are averaged to produce the final smoothed image.

123

The denoising results using the BM3D method are truly outstanding. However

the method is complex with several tunable parameters such as patch size, transform

thresholds, similarity measures, etc. Therefore, it may not be very easy to isolate

the exact effect of each component on the denoising performance. Furthermore, the

stacking together of similar patches to form a 3D array imposes a signal structure in the

third dimension. In fact, one would expect the ordering of the individual patches in the

3D array to affect the filter performance.

Another very competitive (albeit computationally expensive) approach for image

denoising, which makes use of non-local similarity, is the total least squares regression

method introduced in [135]. In this method, for each reference patch in the noisy image,

a group of similar patches is created. The reference patch is then expressed as a linear

combination of the similar patches and the coefficients of this linear combination are

obtained using total least squares regression. As compared to a simple least squares

regression, the total least squares regression accounts for the fact that the noise exists

in the reference patch as well as the other patches in the group. The computational

complexity is cubic in the number of patches in the group, which is a drawback of this

approach.

6.6 Use of Residuals in Image Denoising

There exists some research which tries to make use of the properties of the residual

to drive or constrain the image filtering process. Under the assumption of a noise model,

the overall idea is to drive the denoising technique in such a way that the residual

possesses the same characteristics as the noise model.

6.6.1 Constraints on Moments of the Residual

One of the earliest among these is an approach from [86] which assumes a

Gaussian (i.i.d.) noise model of known σ and tries to impose constraints on the statistics

of the residual (mean and variance) in each iteration of the filtering process. Starting

from a noisy image I0, their algorithm tries to find a smoothed image I (both defined

124

on a domain Ω) that minimizes the energy functional given in Equation 6–6. The

corresponding Euler-Lagrange has a Lagrange multiplier which is computed by gradient

projection, taking care to ensure that the constraints are not violated [86]. A similar

approach has also been independently proposed in [136].

6.6.2 Adding Back Portions of the Residual

In traditional denoising, the filtering algorithm is run (for some K iterations) to

produce a smoothed image, and the residual is ignored. In the approach by Tadmor,

Nezzar and Vese (called ‘TNV’) [137], a smoothed image J1 is obtained from a noisy

image J0 by minimizing an energy functional containing two terms: the total variation

of J1, and the mean square difference between J0 and J1 integrated over the domain

(data fidelity term). This is called as the first step of the algorithm. The residual J0 − J1,

however, is not discarded. Instead the same filtering algorithm is now again run on the

residual, in a second step. This decomposes J0 − J1 into the sum of a smoothed image

J2 and another residual J0 − J1 − J2. J2 is added back to the denoised output of the

first step, i.e. to J1. This procedure is repeated some K times, yielding a final ‘denoised

image’ J1 + · · ·+ JK . As K → ∞, the authors of [137] prove that the original noisy image

is obtained again. In practice, an upper bound is imposed on K as a free parameter. A

similar algorithm has also been developed by Osher et al. [138] with a modified data

fidelity term.

Critique: For both techniques, K is a crucial free parameter. Also when the

smoothed residual is added back at every step, some noise may also get added to

the signal. In experimental results published in a comprehensive survey of image

denoising algorithms [2], the residuals obtained on the Lena image using methods from

[137] and [138] are not totally devoid of image features.

6.6.3 Use of Hypothesis Tests

This approach proposed in [139] assumes that the exact noise model is known

a priori and that the underlying image is piecewise flat. To filter a noisy image, the

125

algorithm tries to approximate it locally (i.e. in the neighborhood of some radiusW

around each point) by a constant value in such a way that the residual satisfies the noise

hypothesis. In fact, it chooses the maximum value ofW for which the local distribution

of the residual in a small neighborhood around any image pixel is close to the assumed

noise distribution. Here, ‘closeness’ is defined using one of the canonical hypothesis

tests. This procedure is repeated at every point in the image domain. It should be

noted that this problem is difficult in two or more dimensions, whereas in 1D it can be

solved easily using a dynamic programming method like a segmented least squares

approach [140]. Another related paper [141] presents an algorithm that is similar to the

TNV approach described earlier, with one important change: the ‘denoised’ residual is

added back only at those points (x , y) such that the residual in a neighborhood around

(x , y) violates the hypothesis that it consists of a set of samples from the assumed

noise distribution. Experimental results are demonstrated with a very simple isotropic

Gaussian smoothing algorithm with a decrease in the amount of features that are visible

in the residual.

6.6.4 Residuals in Joint Restoration of Multiple Images

The authors of [142] observe that when multiple images of an object acquired,

the noise affecting the individual images is often independent across the images, even

if the noise model is not independent of the underlying signal. They exploit this in a

denoising framework that enforces the individual residuals for each of the images to be

independent of one another. The particular independence measure chosen is the sum

of pairwise mutual information values. An iterative optimization procedure is proposed.

Critique: As argued in section 6.6.1, mere statistical constraints do not guarantee

‘noiseness’ of the residual, especially if more complicated image models are to be

considered. A bigger problem is that merely satisfying the properties of the residual is

not guaranteed to lead to adequate restoration of the image geometry. In fact, a direct

enforcement of noise-like properties of the residual can lead to serious undersmoothing.

126

The properties of the residuals, can however be used for automatically finding individual

smoothing parameters, as will be discussed in Chapter 8.

6.7 Denoising Techniques using Machine Learning

In the transform domain methods discussed in Section 6.4, a fixed transform basis

is chosen for signal representation. There exist several papers which attempt to tune

the transform basis based on the statistics of image features or patches. For instance

in [118], the authors use noise-free training data to learn independent components

of the training vectors. The learned ICA basis is then used to denoise noisy image

patches using a maximum likelihood model, leading to a soft shrinkage operation. In

this particular case, the learned basis is orthonormal. However, there has been recent

interest in learning overcomplete bases (also called dictionaries), where the number

of vectors in the dictionary exceeds their dimension. This has largely been pioneered

by works such as [143], [144], [145]. These approaches are of interest because the

inherent redundancy of the vectors in the dictionary leads to more compact (sparser)

representation of natural signals. In fact, these papers specially tune the dictionaries in

such a way that natural image patches possess sparse representations when projected

onto the dictionary.

In the more recent literature, the KSVD algorithm [146], [147] has gained popularity

in the image denoising community. In this technique, starting from overlapping patches

from a noisy image, an overcomplete dictionary as well as sparse representations of

the patches in that dictionary are learned in an alternating minimization framework.

The algorithm has produced excellent results on denoising [146]. The name KSVD

stems from the fact that the K columns of the dictionary are updated one at a time

using a singular value decomposition (SVD) operation. A multi-scale variant of this

algorithm (known as MS-KSVD) learns dictionaries to represent patches at two or more

scales leading to further redundancy [114]. This algorithm has yielded state of the art

performance, on par with the BM3D algorithm [134] described in the previous section

127

[114]. However KSVD and MS-KSVD both require an expensive iterated optimization

procedure for which no convergence proof has been established so far. The alternating

minimization framework is subject to local minima [114] and requires parameters such

as level of sparsity. Some of these parameters are chosen to be a direct function of

the noise variance. However in successive iterations of the optimization, the image

is partially smoothed, and this therefore affects the quality of subsequent parameter

updates which are affected by the changes in the noise variance (see sections 3 and 4

of [146]).

In the KSVD approach, a single overcomplete dictionary is learned for the entire

image. As opposed to this, the authors of [148] perform a clustering step on the patches

from the noisy image and then represent the patches from each cluster separately

using principal components analysis (PCA). In practice, the clustering step (K-means) is

performed on coarsely pre-filtered patches and the learned PCA bases are necessarily

of lower rank. The denoised intensity values are produced by means of a kernel

regression framework from [65]. The entire procedure is iterated for better performance.

The authors call this as the KLLD (K locally learned dictionaries) approach [148]. The

idea of using a union of different orthonormal (PCA) bases for each cluster (as opposed

to a single complex basis for the union of all clusters) is interesting. However, the

method has free parameters such as the pre-filtering procedure, the clustering algorithm

and the number of clusters.

It should be noted that both KSVD and KLLD use non-local patch similarity in

learning the bases. Hence, they can also be classified as non-local approaches

described in Section 6.5. The sparse dictionary based methods have gained popularity

not only in image denoising but also other restoration problems such as super-resolution

[149].

128

6.8 Common Problems with Contemporary Denoising Techniques

There are common issues concerning most contemporary denoising techniques

which we briefly review in this section.

6.8.1 Validation of Denoising Algorithms

There is no clear consensus on the methods for validation of the performance of

denoising algorithms. Given two denoising algorithms A and B (of size N pixels) and

their output on a noisy input image, the primary requirement is that of a valid quality

measure for comparing their relative performance. The quality measure decides the

proximity between the denoised image and the true image (i.e. the clean image, devoid

of any degradation). The most common quality measure is the mean squared error

(MSE) defined as follows

MSE(A,B) =1

N

N∑i=1

(Ai − Bi)2 (6–19)

and the peak signal to noise ratio (PSNR) which is computed as follows from the MSE

PSNR(A,B) = 10 log102552

MSE(A,B). (6–20)

The lower the MSE (or higher the PSNR), the better the performance of the denoising

algorithm.

Now, the ideal quality measure should be in tune with what we as humans perceive

to be a ‘better’ image. This is a tricky issue as it is affected by several factors including

the type of display system. While the MSE is a very intuitive measure (and indeed is

also a metric), it is not necessarily in tune with perceptual quality because it weighs

errors in every pixel equally. However, the human eye is hardly sensitive to minor errors

in high-frequency textured regions such as the fur of the mandrill in Figure 6-1, or

carpet texture. Therefore, even if an image contains perturbations in the high-frequency

textured portions of an image and consequently has high MSE, it may still be regarded

129

as a good quality image from a perceptual point of view. Several such limitations of the

MSE/PSNR have been documented in [150] with numerous examples.

Furthermore, the authors of [150], [151] propose a new quality measure termed

the structured similarity index (SSIM) which measures the similarity between the

corresponding patches of images A and B. The similarity is measured in terms of

the proximity of the mean values of the patches, their variances and also a structural

similarity in terms of the correlation coefficient. Given two patches A(i) and B(i) from

images A and B respectively, this is represented as follows:

SSIM(A(i),B(i)) =2µaµbµ2a + µ

2b

× 2σaσbσ2a + σ

2b

× σabσaσb

(6–21)

=2µaµbµ2a + µ

2b

σabσ2a + σ

2b

(6–22)

where µa and µb are the mean value of patches A(i) and B(i) respectively, σa and σb are

their respective standard deviations and σab is the covariance between the patches. For

comparison between the complete images, the measure is defined as

SSIM(A,B) =1

NP

∑i

SSIM(A(i),B(i)) (6–23)

where NP is the number of non-overlapping patches. In practice, the statistics from

all patch locations are not weighed equally, but using a symmetric Gaussian window

of some chosen (small) standard deviation [151]. The SSIM is known to correlate well

with the human visual system [151], however it is unstable when any of the terms in

the denominators approach zero, and requires appropriate scale selection. While there

exists a multi-scale equivalent [63] (denoted as MSSIM) which combines SSIM values

at different image scales, the choice of scale for measurement of the statistics is still

an open issue. In the experimental results reported in Chapter 7, we perform validation

using PSNR, using SSIM with the default parameter settings used by the authors of [63]

(for instance window size of 11× 11).

130

6.8.2 Automated Filter Parameter Selection

While research on image denoising has been very extensive, the literature on

automated methods for selecting appropriate filter parameters is not very large. Most

techniques select the best parameter retroactively in terms of optimizing a full-reference

image quality measure. We defer further discussion on this topic to Chapter 8.

A B

C

Figure 6-1. Mandrill image: (A) with no noise, (B) with noise of σ = 10, (C) with noise ofσ = 20; the noise is hardly visible in the textured fur region (viewed bestwhen zoomed in the pdf file)

131

CHAPTER 7BUILDING UPON THE SINGULAR VALUE DECOMPOSITION FOR IMAGE

DENOISING

7.1 Introduction

This chapter describes two new algorithms for gray-scale image denoising. Our

methods are largely based upon the classical technique of singular value decomposition

(SVD), a popular concept in linear algebra. The SVD was first applied for image

filtering and compression applications in [152] and [153]. On a stand-alone basis, its

performance on filtering leaves much to be desired. However during this thesis, we

have explored several ideas which build upon the SVD, leading to simple and elegant

techniques with excellent performance. Many of the intermediate ideas that were

explored, failed to produce good results in terms of denoising performance. While the

vast majority of contemporary research literature focuses only on positive results, we

choose to adopt a different philosophy. We shall present negative results (and wherever

possible, analyze the reasons for the negative results) in addition to the positive ones

that are on par or better than the state of the art. We hope that this will provide readers

of this thesis with better insight and open up ideas for future research.

We observe that a principled denoising technique can be motivated by the following

considerations. What constitutes a good model for the images being dealt with? What

is known about the noise model? What properties distinguish a clean image from

one containing pure noise? We make the following assumptions in the theoretical

description and experimental results. We assume a gray-scale image in the intensity

range [0, 255] defined on a discrete rectangular domain Ω. In the techniques that we

investigate, we exploit different well-known properties of natural images. We assume

a zero mean i.i.d. (independent and identically distributed) Gaussian noise model of a

fixed standard deviation σ, as the common degradation process. We do not cover the

case of signal-dependent noise or those without precise probabilistic characteristics

(such as noise induced by lossy compression algorithms) in this thesis.

132

7.2 Matrix SVD

The matrix SVD is a popular technique in linear algebra with a wide variety of

applications in signal processing, such as filtering, compression and least-squares

regression to name a few. Given a matrix A of size m × n defined on the field of real

numbers, there always exists a factorization of the following form [154]

A = USV T (7–1)

where U is a m × m orthonormal matrix, S is a m × n diagonal matrix of positive

‘singular’ values and V is a n × n orthonormal matrix. Conventionally, the entries of S

are arranged in descending order of magnitude. Moreover, if the singular values happen

to be unique, the matrix SVD is always unique, modulo sign changes on the columns of

U and V . The columns of V (called the right singular vectors) are the eigenvectors of

ATA, whereas the columns of U (called the left singular vectors) are the eigenvectors of

AAT and the singular values in S turn out to be the square roots of the eigenvalues of

ATA (equivalently AAT ). A geometric interpretation of the SVD (assuming real vector

spaces) is presented in [154]. The matrix SVD has beautiful mathematical properties

such as providing a principled method for the nearest orthonormal matrix, and the best

lower-rank approximation to a matrix (both in the sense of the Frobenius norm) [154].

7.3 SVD for Image Denoising

It is well-known that the singular values of natural images follow an exponential

decay rule [155]. This property also holds true for Fourier coefficient magnitudes. In

fact, the SVD bases have a frequency interpretation. The smaller singular values of the

image correspond to higher frequencies and the large values correspond to the lower

frequency components. This property of the SVD has been used both in denoising [152]

as well as in compression [153].

Now, consider a noisy image A (a degraded version of an underlying clean

image Ac ) affected by additive Gaussian noise of standard deviation σ. Filtering is

133

accomplished by computing the decomposition A = USV T and then nullifying the

smaller values of A, which effectively discards higher frequency components (which are

known to correspond mostly to noise) [152]. An example of this procedure is illustrated

in Figure 7-1, where all singular values smaller than some k th singular value were set to

zero. It is clearly seen that low rank truncation (i.e. if the index k is chosen to be small)

produces blurry images and increasing the rank adds in image details but introduces

more and more noise. Taking this sub-par performance into account, this decomposition

is instead performed at the level of image patches. Indeed, small patches capture local

information which can be compactly represented with small-sized bases. The SVD is

computed in a sliding window fashion and filtered versions of overlapping patches are

averaged in order to produce a final filtered image. The averaging is useful for removing

seam artifacts at patch boundaries and also brings in multiple hypotheses. These results

are shown in Figure 7-2 for different settings: (1) rank 1 and rank 2 truncation of each

patch, (2) nullification of patch singular values below a fixed threshold of σ√2 logN

(where N is the number of image pixels), and (3) truncation of singular values in such

a way that the residual at each patch has a standard deviation of σ (i.e. a standard

deviation equal to that of the noise).

7.4 Oracle Denoiser with the SVD

Despite the improvement in results with the patch-based method (as seen upon

comparing Figures 7-2 and 7-1), the filtering performance is still far from desirable.

The main reason for this is that the singular vectors are unable to adequately separate

signal from noise. There are two key observations we make here. Firstly, let Q and

Qn be corresponding patches from a clean image and its noisy version respectively.

Given the decomposition Qn = UnSnV Tn , the projection of Q (the true patch) onto the

bases (Un,Vn) is given as SQ = UTn QVn. This matrix SQ is non-diagonal and hence

contains more non-zero elements than Sn. Note the fact that SQ is ‘denser’ than Sn.

Despite this, if we could somehow change the entries in Sn to match those in SQ , we

134

would now have a perfect denoising technique. Nevertheless, SVD-based filtering

techniques emphasize low-rank truncation or other methods of increasing the sparsity

of the matrix of singular values. We shall dwell more on this point in Section 7.5 in

the context of filtering with SVD bases as well as universal bases such as the DCT.

The second important observation is that the additive noise doesn’t just affect the

singular values of the patch but the singular vectors (which are the eigenvectors of the

row-row and column-column correlation matrices of the patch) as well. Bearing this in

mind, it is strange that SVD-based denoising techniques do not seek to manipulate the

orthonormal bases and instead focus only on changing the singular values. We now

perform the following experiment which starts with a noisy image and assumes that

the true singular vectors of the clean patch underlying every noisy patch in the image

are known or provided to us by an oracle. The denoising technique now proceeds as

follows:

1. Let the SVD for a patch Q(i) from a clean image be Q(i) = USV T . Project the noisypatch Q(i)n onto these bases to produce a matrix SQ(i) = UTQ(i)V .

2. Set to zero all elements in SQ(i) such that |SQ(i)| < λσ.

3. Produce the denoised version of Q(i)n by inverting the projection.

4. Repeat the above procedure in sliding window fashion and average all thehypotheses at every pixel to yield a denoised image.

We term this method as the ‘oracle denoiser’. Given 8 × 8 patches, we choose the

threshold of λ = 3 for the following reasons. Firstly, zero mean Gaussian random

variables with standard deviation σ (i.e. belonging to N (0,σ)) have value less than 3σ

with high probability, and projections of matrices of Gaussian random variables onto

orthonormal bases also obey this rule (experimentally, this probability was observed to

be very close to 1). Secondly, λ = 3 comes close to the idea threshold of λ =√2 log n2

from [113] for patches of size n × n.

135

Sample experimental results with the above technique are shown in Figure 7-3

for two noise levels: 20 and 40. The resulting PSNR values of this ideal denoiser far

exceed the state of the art methods such as BM3D [134]. Clearly, this experiment is

not possible in practice, however it serves as a benchmark, drives home an important

deficiency of contemporary SVD filtering approaches, and chalks out a path for us to

explore: manipulating the SVD bases of a noisy patch, or somehow using bases that are

‘better’ than the SVD bases of the noisy patch, may be the key to improving denoising

performance.

7.5 SVD, DCT and Minimum Mean Squared Error Estimators

As has been described in Section 6.4, nullification of the smaller coefficient values

from the projection of a noisy patch onto a basis is actually a MAP estimator of the

coefficients of the true patch. The MAP estimator is driven by sparsity-promoting image

priors which hold for image ensembles but not necessarily for every individual image.

We therefore explore minimum mean square error (MMSE) estimators for estimation of

the true projection coefficients.

7.5.1 MMSE Estimators with DCT

The idea of using MMSE estimators is inspired by the work in [120]. However, there

is one major difference between the approach from [120] and the one we present here.

In [120], the authors learn a generic rule to optimally perturb the DCT coefficients of

an ensemble of noisy image patches so as to reduce the mean squared error with the

DCT coefficients of their corresponding underlying clean patches. A different rule is

learned for each DCT coefficient (the number of coefficients is equal to the patch-size)

or for each sub-band, though all the rules are common across patches. However, we

have observed experimentally that the optimal rules for patches of different geometric

structures differ significantly from one another (see Figure 7-5). Therefore, we move

away from the notion of a single set of rules for the entire ensemble and instead learn

a different set of rules for each training patch. We make the definition of the word ‘rule’

136

more precise in the following. Consider the i th patch from a database PD of N patches.

We shall denote the patch as Ii . Let its size be n × n and let its k th DCT coefficient be

Ii(k)

where 1 ≤ k ≤ n2. Let us denote Jij as the j th noisy instance of patch Ii where

1 ≤ j ≤ M, and let Jij(k)

be its k th DCT coefficient. Then for each 1 ≤ k ≤ n2 and for

each patch 1 ≤ i ≤ N, we may seek a perturbation εki such that

εki = minε

M∑j=1

(Jij(k)+ ε− Ii

(k))2. (7–2)

Unfortunately, the values of corresponding DCT coefficients belonging to multiple noisy

instances of a patch show considerable variance, which prevents the learning of any

meaningful perturbation rule. To alleviate this problem, we quantize the values of each

DCT coefficient into a fixed number of bins, say B. Thus for the k th coefficient of the i th

patch, we no more learn a single scalar value, but a set of B perturbation values εkib,

one for each bin. This can be mathematically expressed as

εkib = minε

M∑j=1

δb(Jij(k))(Jij

(k)+ ε− Ii

(k))2 (7–3)

where

δb(Jij(k)) =

1 if bm(1)ik −m(2)ikB

c = b

0 otherwise.(7–4)

In the above equation, we define the following terms:

m(1)ik = maxj Jij

(k)(7–5)

m(2)ik = minj Jij

(k). (7–6)

Note that the quantization of the coefficients is motivated by the fact that the perturbation

of the coefficients owing to corruption by Gaussian noise shows some regularity, since

random variables from N (0,σ) lie within a bounded interval [−3σ, +3σ] with very high

probability.

137

Now, given a noisy image (which does not appear in the database PD), we divide

it into patches. For each noisy patch P, we search for its nearest neighbor from the

training patch database PD. Let the index of this nearest neighbor be s. We now apply

the corresponding rules already learned, i.e. the perturbations εksb (1 ≤ k ≤ n2),

to denoise the patch P. As per this rule, the k th coefficient of P, denoted by P(k), is

changed to

P(k) = P(k) + εksb (7–7)

where b is the bin for which δb(P(k)) = 1.

It is quite possible that the value of a particular DCT coefficient P(k) falls outside

the range [m(2)sk ,m(1)sk ]. In such cases we follow the heuristic method of applying the

perturbation from the bin that lies closest to the value P(k). From Equation 7–3, we also

see an implicit assumption that the perturbation values are constant within any bin.

While more sophisticated perturbation functions (say, linear within any bin) are possible,

we stick to piecewise constant functions for simplicity.

7.5.2 MMSE Estimators with SVD

We have previously motivated the fact that using better SVD bases can help in

improving denoising results. Suppose that for each patch Ii in the patch database PD,

we compute its SVD as Ii = UiSiV Ti . We conjecture that the bases (Ui ,Vi) can serve as

effective denoising filters. Again, let Jij be the j th noisy instance of patch Ii (1 ≤ j ≤ M).

The projection of Jij onto (Ui ,Vi) is Sij = UTi JijVi . We seek to learn rules εkib for values

of Jij(k)

quantized into B bins just as in Equation 7–3. Now, given a patch P from a noisy

image, let its nearest neighbor from the database be patch Is . We then project P onto

(Us ,Vs) giving us the matrix Ss = UTs PVs and we modify the coefficients in Ss using the

perturbation rules εksb (1 ≤ k ≤ n2, 1 ≤ b ≤ B) already learned for Is . The perturbation

is carried out in the same way as in Equation 7–7.

138

7.5.3 Results with MMSE Estimators Using DCT

7.5.3.1 Synthetic patches

We first experiment with a set of 15 synthetically generated patches (all of size

8 × 8) of different geometric structures. We generated 500 noise instances of each

patch from N (0, 20). A quantization of 20 bins was used for every DCT coefficient.

The synthetic patches are shown in Figure 7-4. The statistics of the mean squared

errors between the true and reconstructed patches (namely the average, maximum

and median reconstruction errors, all measured across the different noise instances)

are shown in Table 7-1 for two methods: the MAP estimator which sets to zero all DCT

coefficients whose absolute value is below 3σ (to which we shall henceforth refer to as

the MAP estimator), and the MMSE estimator described previously. Clearly, the MMSE

errors are consistently lower. For some patches (such as the X-shaped patch in Figure

7-4), we obtained perturbation functions that were not strictly monotonic, as can be seen

in Figure 7-8.

7.5.3.2 Real images and a large patch database

Next, we built a corpus of 12000 patches of size 8 × 8 taken from the first five

images of the Berkeley database [61], all converted to gray-scale. The size of each

image was about 320 × 480. We generated 500 noise instances of each patch from

N (0, 20). The perturbation values were learned as indicated in Equation 7–3 for a

quantization of 30 bins per coefficient. During training, we again consistently observed

lower reconstruction errors for the MMSE estimator than the MAP estimator. Next,

given a noisy image, we divided it into non-overlapping patches and denoised each

patch as per the perturbation functions learned for the nearest neighbor (in the corpus)

corresponding to each patch. The reconstruction results with this MMSE method as

well as the MAP estimator are shown in Figures 7-6 and 7-7. A quick glance reveals

that reconstruction with the MAP estimator exhibits considerably more ringing artifacts

than the MMSE estimator. But owing to the non-overlapping nature of the patches, both

139

the MMSE and MAP reconstructions show patch seam artifacts. These seam artifacts

can be eliminated by denoising overlapping patches and then averaging the results as

shown in Figures 7-6 and 7-7. Surprisingly, we obtain lower PSNR values for the MMSE

method with overlap than for MAP with overlap. We ascribe this drop in performance of

the MMSE estimator to two factors: errors in the results of the nearest neighbor search

for noisy adjacent patches (the accuracy of which will be affected by noise), and much

more importantly, errors due to the limited patch representation in the database. Indeed,

the nearest neighbor from the database may not be close enough to produce an MMSE

estimator that produces a reconstruction close enough to the true underlying patch.

7.5.4 Results with MMSE Estimators Using SVD

We now explore what happens if similar experiments are performed on SVD bases

(which are properties of individual patches) rather than on universal bases.

7.5.4.1 Synthetic patches

We ran the experiment on the same 15 synthetic patches as in Section 7.5.3.1,

with 500 noisy instances of each patch drawn from N (0, 20). A quantization of 20

bins was used for every SVD coefficient. The synthetic patches are shown in Figure

7-4. The statistics of the mean squared errors between the true and reconstructed

patches (namely the average, maximum and median reconstruction errors, all measured

across the different noise instances) are shown in Table 7-2 for two methods: the MAP

estimator, and the MMSE estimator described previously. The MMSE errors are again

consistently lower than the MAP errors. For some patches (such as the X-shaped patch

in Figure 7-4), we again obtained perturbation functions that were not strictly monotonic,

as can be seen in Figure 7-4. Notice that the errors with MMSE estimators on SVD are

much lower than those with DCT (compare Tables 7-1 and 7-2), the reason being that

in this experiment, we have access to the SVD bases of the true underlying patches

(whereas the DCT was a universal basis).

140

7.5.4.2 Real images and a large patch database

We used the same corpus of patches generated in Section 7.5.3.2. The SVD

bases were computed for all 12000 patches. Perturbation rules were learned to change

the values of the projection matrix to optimize average MSE across noise instances

and these rules were stored. Next, patches from a given noisy image (again, different

from any of the training images) were projected onto the SVD bases of the nearest

neighbor in the corpus. The coefficients were manipulated with the MAP rule as well

as the learned MMSE rules to produce two separate outputs. To our surprise, the

performance of the MMSE estimator was very poor. The MAP estimator with SVD

performed reasonably well but not as well as the one applied on DCT bases. These

results are shown in Figure 7-9 on the Barbara image which was subjected to noise

from N (0, 20) (starting PSNR 21.5). The PSNR values with MMSE on SVD, MAP on

SVD and the oracle estimator were 25.2, 28.85 and 36.6 respectively. Based on this,

we draw the following conclusions. The MMSE errors were very low during training but

high during testing. This clearly indicates an overfitting problem when dealing with SVD

bases, which was much more severe than while dealing with DCT bases. Consider that

we are given an arbitrary training database, and an arbitrarily chosen noisy image for

testing. It is highly unlikely that we could find an exact match in the database for every

image patch. The rules that were learned on the noisy instances of the exact same

patch do not seem to apply very well to other ‘similar’ patches.

However, we wish to emphasize that there is still merit in the idea of attempting to

manipulate the SVD bases. This is evidenced by the improvement in the performance of

the MAP estimator applied on projections onto the SVD bases of the nearest neighbor

from the database, over that of the same estimator applied to the SVD bases of the

noisy patch itself.

141

7.6 Filtering of SVD Bases

We have observed that the SVD bases of adjacent patches (i.e. patches with their

top-left corners at adjacent pixels) from clean natural images tend to exhibit greater

similarity than those from noisy versions of those images. The similarity is quantified in

terms of the angles between unit vectors from corresponding columns of the U matrices

(or those of the V matrices) of the adjacent patches. This observation is clearly a

property of natural image patches (and not a mere consequence of the fact that we

computed SVD bases of matrices that had several rows or columns in common). With

this in mind, we explored the effect of smoothing the U and V bases of adjacent patches

from the image using some averaging techniques. There are three ways this could be

done:

1. Smooth (say by some sort of averaging scheme) the corresponding columns fromthe U matrices of adjacent patches, and the corresponding columns from the Vmatrices of adjacent patches.

2. Smooth (say by some sort of averaging scheme) the outer products UiV Ti (1 ≤ i ≤n, 1 ≤ j ≤ n), i.e. outer products of the corresponding columns from the U and Vbases computed from adjacent patches.

3. Run a diffusion PDE defined specifically for orthonormal matrices on the U basesand also on the V bases (independently).

There are mathematical complications that arise in the first method: the averaging really

ought to be done by respecting the geometry of the space of orthonormal matrices.

However, the orthonormal matrices with determinant +1 are disjoint from those with

determinant -1. This is problematic from the point of view of computing intrinsic

averages. Furthermore, independent averaging of the U and V matrices ignores the

inherent coupling between them (as given a patch P, they are eigenvectors of PTP and

PPT respectively). Taking averages of outer products of corresponding columns from U

and V helps bring in this dependence. However, it still ignores the dependence between

the different outer products themselves.

142

Ignoring the above mathematical issues, we computed Euclidean averages. As the

resultant matrices were no more orthonormal, we orthonormalized them using a QR

decomposition. While computing averages of outer products (in method 2), there are

considerable complications in forcing the averaged outer-product to lie in the space of

matrices of the form v1vT2 where |v1| = |v2| = 1, which were ignored in our experiments.

We performed image denoising experiments by first smoothing the bases computed

from 8 × 8 patches using either of the three techniques, projecting the patches onto

the bases, applying the MAP rule on the coefficients of the projection matrix and

reconstructing the patch by inverting the transform. In case of the diffusion PDE defined

for orthonormal matrices, we used the following isotropic heat equation defined in [156]

for matrix U ∈ SO(p × p):dUk

dt= −Lk +

p∑i=1

(Li .Uk)U i (7–8)

where

Lk = Ukxx + Ukyy (7–9)

and Uk stands for the k th column of U.

Note that coupling between the U and V matrices can be imposed indirectly by

introduction of a data fidelity constraint on the patch P in addition to the smoothness

term on the U and V matrices, and then executing alternating PDEs (Euler-Lagrange

equations) on U and V . However, experimental results on averaging of the SVD bases

were in general not satisfactory. Similar experiments were repeated with nonlocal

averaging of similar U and V matrices from different regions of the image, and there was

no improvement in the results. We conjecture that the smoothness of U and V bases

from adjacent patches may not be a strong enough property of natural images.

7.7 Nonlocal SVD with Ensembles of Similar Patches

We now present an algorithm for image denoising using a non-local extension of the

SVD. We call this algorithm non-local SVD or NL-SVD.

143

We know that the SVD of a matrix P ∈ Rm×n is given as P = USV T where the

columns of U consist of the eigenvectors of the matrix

Cr = PPT (7–10)

where the element of Cr from the i th row and j th column is given as

Crij =∑k

PikPjk =< Pi ,Pj > (7–11)

where Pi and Pj stand for the i th and j th rows of P respectively. Similarly the columns of

V consist of the eigenvectors of the matrix

Cc = PPT (7–12)

where the element of Cc from the i th row and j th column is given as

Ccij =∑k

PkiPkj =< PTi ,P

Tj > (7–13)

where PTi and PTj stand for the i th and j th columns of P respectively. Note that Cr and

Cc are the row-row and column-column correlation matrices of P respectively. We also

know that the SVD gives us the optimal low-rank decomposition of P. In other words, the

optimal solution to

E(P) = ‖P − P‖2 (7–14)

subject to the constraint

rank(P) = k k < m, k < n (7–15)

is given by

P = ‖Uk SV Tk ‖2 (7–16)

where Uk and Vk are the first k columns of U and V respectively and S contains the k

largest singular values of S . This is often called as the Eckhart-Young theorem [154].

144

Given the inadequate performance of the local patch SVD, we continue our search

for ‘better’ bases to represent each patch. With this in mind, we now explore what would

happen if we were to consider a non-local generalization of the SVD. Given a patch P

from the noisy image, we look for other patches in the image that are ‘similar’ to P. We

will give a precise definition of similarity later in Section 7.7.1. Let us consider that there

are K such similar patches (including P) which we label as Pi where 1 ≤ i ≤ K .

Next, we ask the following question: what single pair of orthonormal matrices Uk and Vk

will provide the best rank-k approximation to all the patches Pi? In other words, what

(Uk ,Vk) minimizes the following energy?

E(Uk , Si,Vk) =K∑i=1

‖Pi − UkSiV Tk ‖2 (7–17)

where

UTk Uk = I (7–18)

V Tk Vk = I (7–19)

∀i ,Si ∈ Rk×k . (7–20)

The solution to this problem is given by an iterative minimization (starting from random

initial conditions) presented in [157]. Note that the matrices Si in this case are not

diagonal. Note also that the basis Uk ,Vk does not correspond to the individual SVD

bases but to a basis pair that is common to all the chosen patches. Related work in

[155] presents an alternating minimization framework with the additional (heuristically

driven) constraint that all the matrices Si are diagonal. This constraint is imposed at

every step of the alternating minimization framework. An approximate solution to the

energy function in Equation 7–17 is presented in [158]. This solution, which is called

as the 2D-SVD, can be computed in closed form and obviates the need for expensive

iterative optimizations. The 2D-SVD for the patch collection Pi is given as follows.

145

Consider the row-row and column-column correlation matrices

Cr =

K∑i=1

PiPTi (7–21)

Cc =

K∑i=1

PTi Pi . (7–22)

Then Uk contains the first k eigenvectors of Cr corresponding to the k largest eigenvalues

of Cr , and Vk contains the first k eigenvectors of Cc corresponding to the k largest

eigenvalues of Cc . The precise error bounds for the approximate solution w.r.t. the true

global solution are derived in [158].

We use this non-local SVD framework in a denoising algorithm and we shall show

later that this produces results competitive with the start of the art. We start off by

dividing the given noisy image into patches. For each ‘reference’ patch, we collect

patches similar to it and obtain the common basis for them using the non-local SVD

method. However, this leaves open the problem of deciding on the best rank k for the

bases, which need not be constant across patches of different geometric structure.

We obviate the need for selection of this parameter by following a different approach.

We compute the full-rank orthonormal bases U and V , i.e. we choose k = n for n × n

patches. Now the given noisy patch P is projected onto the pair (U,V ) producing the

matrix S (P) = UTPV . Essentially, we can write the entries of P as

Pij =∑kl

S(P)kl UkiVlj (7–23)

which is equivalent to a linear combination of outer products of the form UkV Tl (1 ≤ k ≤

n, 1 ≤ l ≤ n). We conjecture that this formulation has an interpretation in terms of 2D

spatial frequencies wherein the smaller coefficient values correspond to higher values

of at least one of the frequencies. Therefore, we choose to nullify the coefficients with

smaller values (as decided by a threshold). Given such a ‘filtered’ projection matrix S (P),

we reconstruct the patch. This operation is repeated on overlapping patches in a sliding

146

window fashion and the overlapping hypotheses are aggregated by averaging leading to

a final filtered image. Crucial to the performance of this filter is the choice of a notion of

patch similarity and also the choice of thresholds for removing smaller coefficients. We

discuss these choices below.

7.7.1 Choice of Patch Similarity Measure

Given a reference patch P ref in a noisy image, we can compute its K nearest

neighbors from the image, but this requires a choice of K which may not be the same

across different image patches. Hence, we revert to a distance threshold τd and select

all patches Pi such that the total squared difference between P ref and Pi is below τd .

Note that we have throughout assumed a fixed and known noise model - N (0,σ). If we

were to assume that P ref and Pi were different noisy versions of the same underlying

patch, we observe that the following random variable has a χ2 density with z = n2

degrees of freedom:

x =

n2∑k=1

(P refk − Pik)2

2σ2. (7–24)

The cumulative of a χ2 random variable with z degrees of freedom is given by the

expression

F (x ; z) = γ(x

2,z

2) (7–25)

where γ(x , a) stands for the incomplete gamma function defined as follows:

γ(x , a) =1

Γ(a)

∫ xt=0

e−tta−1dt (7–26)

with Γ(a) being the Gamma function defined as

Γ(a) =

∫ ∞

0

e−tt(a−1)dt. (7–27)

We observe that if z ≥ 3, for any x ≥ 3z , we have F (x ; z) ≥ 0.99. Therefore for a

patch-size of n × n and under the given σ, we choose the following threshold for the total

147

squared difference between the patches:

τd = 6σ2n2. (7–28)

Thus if two patches are noisy versions of the same clean patch, this threshold will

pick them with a very high probability. But the converse is not true, and therefore we may

end up collecting patch pairs that satisfy the threshold but are quite different structurally.

To eliminate such ‘false positives’, we observe that if P ref and Pi are noisy versions of

the same patch, the values in P ref − Pi belong to N (0,√2σ). This motivates us to use a

hypothesis test, in this particular case the one-sided Kolmogorov-Smirnov (K-S) test. To

avoid having to choose a fixed significance level, we use the p-values output by the K-S

tests as a weighting factor in the computation of the correlation matrices. Therefore we

rewrite them as follows:

Cr =

K∑i=1

pKS(Pref ,Pi)PiP

Ti (7–29)

Cc =

K∑i=1

pKS(Pref ,Pi)P

Ti Pi (7–30)

with pKS(P ref ,Pi) being the p-value for the K-S test to check how well the values in

P ref − Pi conform to N (0,√2σ). This thus gives us a robust version of the 2D-SVD.

There is a difference between our approach and robust versions of PCA, such as the

L1-norm (robust) PCA in [159]. We do not need to choose an arbitrary robust norm,

but use a weighting function directed by a hypothesis test instead. This is akin to

computation of fuzzy covariance matrices in fuzzy robust PCA [160].

In practice, we observed that the threshold τd = 6σ2n2 was too conservative.

That is, most patches Pi which differed from the reference patch by more than 3σ2n2,

yielded p-values pKS(P ref ,Pi) that were very close to zero. Hence we used the less

conservative bound τd = 3σ2n2 in our experiments. This also led to some improvement

in computational speed. We implemented a variant of our method in which only the

148

threshold τd was used for patch selection, and the hypothesis test was entirely ignored.

Surprisingly, we did not experience any significant drop in performance on our datasets

if the hypothesis test was neglected. Nonetheless in all reported results, we still used

the hypothesis test because it is a principled way of mitigating the effect of false

positives. An example of the phenomenon of false positives is illustrated in Figure

7-10. The two images in Figure 7-10 are structurally very different (containing graylevels

of 10 and 40), and yet the MSE between their noisy versions (σ = 20) is only 4075 which

falls below the threshold of 3σ2 = 4800. However the KS-test yields a p-value very close

to 0, thereby providing a better indication of structural dissimilarity.

It should be further noted that even the bound τd ≤ 3σ2n2 is quite conservative. It

can be refined using the fact that the χ2 density can be approximated as N (n2, n√2) if

n2 is large. This results follows from the central limit theorem and holds good for n2 ≥ 64.

This gives us the following refined bound:

τd = (n2 +

√2× 2.362n)2σ2 = 2(n2 + 3.29n)σ2 (7–31)

from the inverse cumulative for N (n2,√2n) at 0.99.

7.7.2 Choice of Threshold for Truncation of Transform Coefficients

As our noise model is N (0,σ), we observe that the corresponding random variables

in n × n patches have magnitude less than σ√2 log n2 with very high probability,

as also the entries of the corresponding projection matrices (onto orthonormal

bases/basis-pairs). Hence we assume that coefficients less than this threshold have

been produced due to noise. This threshold happens to be the universally optimal

threshold for wavelet denoising with hard thresholding [113] (also see Section 6.4), and

holds specifically for i.i.d. Gaussian noise for any given orthonormal basis. While hard

thresholding may lead to elimination of some useful high-frequency information, this loss

is compensated through the redundancy from overlapping patches [108].

149

7.7.3 Outline of NL-SVD Algorithm

The NL-SVD is outlined here below:

1. Divide the image into overlapping patches.

2. For each patch P ref (called as ‘reference patch’), find patches Pi from the imagethat are similar to it in the sense explained in Section 7.7.1.

3. Compute the weighted row-row and column-column correlation matrices Cr and CcPi as per Equation 7–29.

4. Find the eigenvectors of Cr to give orthonormal matrix U and those of Cc to giveorthonormal matrix V .

5. Project P ref onto (U,V ) to give S ref = UTP refV .

6. Set all small entries of S ref to zero, as discussed in Section 7.7.2.

7. Reconstruct P ref using P ref = US refV T and accumulate the pixel values to theappropriate location in the image.

8. Repeat above steps for all image patches.

9. Aggregate all the hypotheses and average them to produce the final filtered image.

7.7.4 Averaging of Hypotheses

Note that the procedure for averaging of hypotheses produced for a patch is

common to contemporary patch-based algorithms not only for image denoising

applications [134], [146], [108] (where it is called as ‘translation invariant denoising’),

but also for several other applications such as texture synthesis [161]. We have

experimented with other aggregation procedures such as finding the median of all

available hypothesis values, re-filtering of pixel values or learning weights for weighted

linear combinations. While being more computationally expensive, none of these

procedures improved the performance beyond simple averaging.

7.7.5 Visualizing the Learned Bases

We now present two examples of the bases learned to show the effect of the

structure of the patch and to visualize the corresponding bases. The first example (in

Figure 7-11) is a patch of size 8× 8 containing oriented texture from the Barbara image.

150

The patches similar to it (as measured in the noisy version of that image) are shown

alongside, as also the learned bases. The bases that we visualize are actually 64 outer

products of the form UiV Tj (1 ≤ i , j ≤ 8). We present a second example which contains

a high-frequency fur texture from the mandrill image, in Figure 7-12. In Figure 7-13, we

show outer-products of 8 × 8 DCT bases for comparison with those in Figure 7-11 and

Figure 7-12.

7.7.6 Relationship with Fourier Bases

It is a well-known result that the principal components of natural image patches

(in this case, just rows or columns from image patches) are the Fourier bases [107].

Furthermore, this property is a consequence of the translation invariance property of the

covariance between natural images. In fact, it is proved in [162] (see section 5.8.2) that

under the assumption of translation invariance, the eigenvectors of the covariance matrix

of natural image patches turn out to be sinusoidal functions of different frequencies.

The aforementioned fact can be experimentally observed by computing the principal

components of a large ensemble of patches of fixed size - the results are very close to

DCT bases (the real components of the Fourier bases). We computed the row-row and

column-column covariance matrices of 8 × 8 patches sampled at every 4 pixels from

all the 300 images of the Berkeley database [61] converted to gray-scale (i.e. a total of

2.88× 106 patches). The eigenvectors of these matrices were very similar to DCT bases

as measured by the angles between corresponding basis vectors: 0.2, 4, 4, 6.8, 5.6, 6, 4

and 3 degrees.

For NL-SVD, the consequence of this result is as follows. If for every reference

patch, the correlation matrices were computed from several patches without attention

to similarity, we could get a filter very similar to the sliding window DCT filter (modulo

asymptotics, and barring the difference due to robust PCA).

151

7.8 Experimental Results

We now describe our experimental results. For our noise model (i.e. additive and

i.i.d. N (0,σ)), we pick σ ∈ 5, 10, 15, 20, 25, 30, 35. We perform experiments on

Lansel’s benchmark dataset [62] consisting of 13 commonly used images all of size

of 512 × 512. We pit NL-SVD against the following: NL-Means [2], KSVD [146], our

implementation of a 3D-DCT algorithm (see Section 7.8.5), BM3D [134] and the oracle

denoiser from Section 7.4. For comparison at each noise level, we use PSNR values

as well as SSIM values at patch-size 11 × 11 (as per the implementation in [63]). (For

the definition of SSIM, refer to Section 6.8.1.) All these metrics were measured by first

writing the images into a file in one of the standard image formats (usually, pgm) and

then reading them back into memory. Though this introduces minor quantization effects

(and usually reduces the PSNR/SSIM values slightly for all methods), we follow this

approach as it represents realistic digital storage of images.

In the case of BM3D and NL-Means, we used the software provided by the authors

online. For KSVD, we used the results already reported by the authors on the denoising

benchmark [62]. These results were available only for noise levels upto and including

σ = 25. For BM3D, we report results on both stages of their algorithm: the intermediate

stage, as well as the final stage which performs empirical Wiener filtering on the output

of the earlier stage. We refer to these stages as ‘BM3D1’ and ‘BM3D2’ respectively.

To each of the above algorithms, the noise σ is specified as input (which is useful

for optimal parameter selection in their provided softwares). For NL-SVD, we used 8 × 8

patches in all experiments and a search window radius of 20 around each point. The

search window radius is not a free parameter as it affects only computational efficiency

and not accuracy. In fact larger sizes of the search window did not improve the results

in our experiments. There are no other free parameters in our technique, apart from the

patch-size which is also true of all other patch-based algorithms in the field. Later, in

Section 7.9, we present a criterion for patch-size selection by measuring the correlation

152

coefficient between patches from the residual image (i.e. difference between noisy and

denoised images). For NL-means, we used 9 × 9 patches throughout, with a search

window radius of 20. For BM3D implementation, we used the default settings of all the

various parameters as obtained from the authors’ software (their selected patch-size is

again 8 × 8). The results for KSVD have been reported by the authors themselves, and

hence we assume that the optimal parameter settings were already used for generating

those results.

7.8.1 Discussion of Results

From the PSNR results presented in Tables 7.12, 7.12, 7-7, 7-9, 7-12, 7-14 and

7-16, and the corresponding SSIM results in Tables 7-4, 7-6, 7-8, 7-10, 7-13, 7-15 and

7-17, we make several observations. NL-SVD is consistently superior to NL-Means in

terms of PSNR and SSIM. These tables also contain results of the HOSVD algorithm,

our second technique, which we shall be presenting later in Section 7.10. In all the

tables at the end of the chapter, we have used numbers to refer to image names to save

space. The numbers and the corresponding names are as follows: 13 - airplane, 12 -

Barbara, 11 - boats, 10 - couple, 9 - elaine, 8 - fingerprint, 7 - goldhill, 6 - Lena, 5 - man,

4 - mandrill, 3 - peppers, 2 - stream, 1 - Zelda.

7.8.2 Comparison with KSVD

Our PSNR and SSIM values are comparable to those reported for KSVD. However,

NL-SVD has several other advantages as compared to KSVD from a conceptual as well

as implementation point of view. KSVD learns an overcomplete dictionary on the fly

from the noisy image. This procedure requires iterated optimizations and is expensive.

The method is also prone to local minima and this puts artificial limits on the size of the

dictionary that should (or can) be learned [114]. The algorithm requires parameters that

are not easy to tune: the number of dictionary vectors (K ), parameter for the stopping

criterion for the pursuit projection algorithm and the tradeoff between data fidelity and

sparsity terms. On the other hand, NL-SVD derives a spatially adaptive basis at each

153

pixel in one step and requires no further iterations. Moreover, given patches of size p×p,

we learn matrix bases of size p × p at each point (see Section 7.12), whereas KSVD

learns one dictionary of size p2 × K where K >> p2. There exists a multiscale version

of KSVD [114] which has produced improvement in the performance of the original

algorithm from [146] (see Table 3 of [114]), but we haven’t included it in the comparisons

as we were unable to obtain an efficient implementation for the same.

7.8.3 Comparison with BM3D

The current state of the art technique in image denoising is the BM3D method

from [134]. The BM3D algorithm works on an ensemble of patches from the image

that are similar to each reference patch. It treats the ensemble as a 3D array, and a 3D

transform is applied to this patch ensemble for the purpose of filtering. This treatment of

a sequence of (possibly overlapping) patches as a signal is conceptually strange.

The specific implementation in [134] adopts the following steps. Firstly, the similarity

between a patch from the noisy image and other patches from the same image is

measured using the L2 distance between their respective DCT coefficients after first

setting to zero all coefficients below a threshold. In other words, the patches are

pre-filtered (solely) for the purpose of similarity computation. Next, the individual noisy

patches in the group are filtered using a 2D-DCT or 2D biorthogonal wavelets with hard

thresholding (with the threshold for coefficients chosen as a fixed multiple of the known

noise σ). Finally, the individual pixel stacks created from the filtered patches (from the

earlier step) are further filtered by using a Haar transform. The multiple hypotheses

appearing at each pixel are aggregated to produce a filtered image. This is called as

the ‘intermediate stage’ of the BM3D algorithm (which we refer to as ‘BM3D1’). This is

followed by a second stage which further filters the output of BM3D1. Patches from the

output image of BM3D1 that are similar to a reference patch from that image are again

stacked together, and a 3D transform is applied. The transform coefficients are modified

using an empirical Wiener filter. The transform is inverted followed by aggregation

154

of multiple hypotheses to produce the final filtered image. This final stage is termed

‘BM3D2’. The exact flow-chart for all these steps is given in [134].

The overall BM3D algorithm contains a number of parameters: the choice of

transform for 2D and 1D filtering (whether Haar/DCT/Biorthogonal wavelet), the

distance threshold for patch similarity, the thresholds for truncation of transform domain

coefficients, a parameter to restrict the maximum number of patches that are similar

to any one reference patch, and the choice of pre-filter while computing the similarity

between patches in the first stage (BM3D1). There is an analogous set of parameters

for the second stage that uses empirical Wiener filtering (BM3D2) over and above the

results from stage 1. In fact, given the complex nature of this algorithm, it may be difficult

to isolate the relative contribution of each of its components. Note that NL-SVD too

requires thresholds for patch similarity and truncation of transform domain coefficients,

but these are obtained in a principled manner from the noise model as explained in

Section 7.7.1 and 7.7.2. The BM3D implementation in [134] uses fixed thresholds

with an imprecise relationship to the noise variance. For instance, it uses a distance

threshold of 2500 if the noise σ ≤ 40 and a threshold of 5000 otherwise, a transform

domain threshold of 2.7σ, and a patch size of 32× 32 and a distance threshold of 400 in

the Wiener filtering step. Unlike BM3D, we do not resort to any pre-filtering methods for

finding the distance between noisy patches, but instead use principled approaches like

hypothesis tests. Furthermore, NL-SVD tunes the bases in a spatially adaptive manner

instead of using fixed bases. It must also be mentioned that the Wiener filtering step in

BM3D2 makes the implicit assumption that the transform coefficients of the underlying

image are Gaussian distributed. It is this Gaussian assumption alone that makes a

Wiener filter (or a linear minimum mean squares estimator) the optimal least squares

estimator [163]. The Gaussian assumption is generally not true for DCT or other

transform coefficients of natural images or image patches. In terms of empirical results,

the PSNR values for NL-SVD were less than those produced by BM3D1 (a margin of

155

0.3 dB) and BM3D2 (a margin of 0.7 dB). However, our algorithm has the advantage of

being simple to implement, being conceptually clean and having parameters that are

obtained in a principled manner.

7.8.4 Comparison of Non-Local and Local Convolution Filters

As described in Section 6.3, convolution filters are a rich class of denoising

techniques. Some of these [101], [156] make explicit use of local image geometries.

For instance, the work in [101] presents an innovative method of exploiting rich local

geometric structures for deriving convolution filters, and pays special attention to

structures such as corners/junctions in addition to edges. The NL-SVD technique in

this thesis takes a different path: it is based on learning spatially adaptive bases that

sparsely represent image patches. Indeed, NL-SVD draws its primary inspiration from

NL-Means which differs in its foundations from local convolution filters on at least two

counts: (1) it draws information from different parts of the image which exhibit some

measure of similarity to the pixel intensity at the current processing location and then

(2) uses this non-local information to modulate the diffusion at the current pixel. The

non-local nature of NL-Means is expected to give it an edge in comparison to purely

local techniques like the aforementioned convolution filters [101].

Following this line of reasoning, it comes as somewhat of a surprise that the purely

local convolution technique in [101] is able to empirically outperform NL-Means on the

commonly used ‘house’ image when degraded by noise drawn from N (0, 20). On five

noise realizations at a fixed σ = 20, the technique from [101] produced a denoised

image having an average PSNR of 33.447 and 33.464 (MSE 29.402 and 29.284)

respectively depending on the type of kernel used1 , whereas NL-Means produced

a PSNR of only 32.72 (MSE 34.760). This suggests that the inclusion of additional

1 We gratefully acknowledge the efforts of Sile Hu in collecting this result.

156

geometric information such as corners/junctions allows purely local convolution methods

to compete on certain images with non-local techniques such as NL-Means.

7.8.5 Comparison with 3D-DCT

In Section 7.8.3, we stated that given the multitude of steps in the BM3D algorithm,

it may be difficult to isolate the individual contribution of each step. We seek to illustrate

this point by comparing NL-SVD with our implementation of BM3D involving purely the

DCT in 3D (on the ensemble of noisy patches that are similar to the reference patch, the

ensemble being represented as a 3D stack). We put an upper limit of K = 30 on the

number of similar patches in an ensemble, which is similar to the BM3D implementation

in [134]. We term this variant as a ‘3D-DCT’. The hard threshold for the 3D-DCT

coefficients is σ√log n2K . As can be seen in the tables at the end of the chapter,

NL-SVD consistently outperforms 3D-DCT. We believe this sufficiently illustrates the

advantages of method for non-local basis learning.

7.8.6 Comparison with Fixed Bases

The choice of ‘best’ basis optimized for denoising performance is still largely an

open issue in signal processing. As a consequence, it may be difficult to compare the

relative merits and demerits of learned bases over universal bases. Learned bases have

the advantage that they allow for tunability to the characteristics of the underlying data.

In our experiments, we have observed better performance with NL-SVD as

compared to filters using a sliding window 2D-DCT. We present a few examples: the

boats image and Barbara image in Figure 7-22, for which we obtained upto 1dB PSNR

improvement over DCT. We also present an example with a large number of repeating

patterns, which clearly illustrates the virtues of nonlocal basis learning over using a

fixed basis. This is illustrated with the checkerboard image in Figure 7-23. Comparative

figures over the benchmark database are presented in Table 7-11.

157

7.8.7 Visual Comparison of the Denoised Images

The original and noisy image (from N (0, 20)), and the denoised images produced

by our algorithm, NL-Means, BM3D1 and BM3D2 can be viewed in Figures 7-14, 7-16,

7-18 and 7-20. The reader is urged to zoom into the .pdf file to view the images more

carefully. The corresponding residual images can be viewed in Figures 7-15, 7-17, 7-19

and 7-21. Note that the residual is calculated as the difference between the noisy and

denoised image, with the difference image normalized between 0 and 255. Clearly,

NL-Means produces residuals with a discernible amount of structure. Finer structural

details can be observed in the residuals produced by our algorithm as well as those

by BM3D1. BM3D2 does produce very noisy residuals. If the images are zoomed in,

one can however observe some strange shock-like artifacts in certain portions of the

denoised images produced by BM3D, especially by BM3D2. One example is Barbara’s

face from Figure 7-14 - see Figure 7-25 for a zoomed-in view. These artifacts are

absent in NL-SVD. However BM3D seems to preserve some finer edges somewhat

better than our technique. See, for instance, the portion of the tablecloth lying on the

table in the Barbara image. We performed a more detailed comparison between our

output and the BM3D output on the Barbara image. For this we computed the absolute

difference between the true image and our output, and the absolute difference between

the true image and the output of BM3D1/BM3D2. These difference images are shown

in Figure 7-24. The mean absolute error values over the entire image were 5.36 (for

NL-SVD), 5.28 (BM3D1) and 4.87 (BM3D2). The mean L2 errors were 53.12, 51.34

and 44.36 respectively. The errors produced by NL-SVD were greater than those by

BM3D1/BM3D2 for roughly only 50 percent of the pixels. We also ran a Canny edge

detector (with the default parameters from the MATLAB implementation) on the true

image, and computed the errors only on the edge pixels. The mean absolute errors

on edge pixels were 6.7, 6.7 and 6.4 for NL-SVD, BM3D1 and BM3D2 respectively,

whereas the mean L2 errors on edge pixels were 77.4, 76.7 and 70.5 respectively.

158

However, for only around 45 percent of the edge pixels, was the error for NL-SVD

greater than that for BM3D1/BM3D2.

7.9 Selection of Global Patch Size

All results in Section 7.8 were reported for a fixed patch-size of 8 × 8, as this

is a commonly used parameter in patch-based algorithms (including JPEG). Here,

we present an objective criterion for selecting the patch-size that will yield the best

denoising performance. For this, we consider the residual images after denoising with a

fixed patch-size p × p, with the threshold for discarding the smaller coefficients chosen

to be σ√2 log p2. Each residual image is divided into non-overlapping patches of size

q × q where q ∈ 8, 9, ..., 15, 16. For each value of q, we compute the average absolute

correlation coefficient between all pairs of patches in the residual image, and then

calculate the total of these average values. The absolute correlation coefficient between

vectors v1 and v2 (of size q2 × 1) is defined as follows:

ρpq(v1, v2) =1

q2|(v1 − µ1)

T (v2 − µ2)|σv1σv2

(7–32)

where µ1 and µ2 are the mean values of vectors v1 and v2, and σv1 and σv2 are their

corresponding standard deviations.

Our intuition is that an optimal denoiser will produce residual patches that are highly

decorrelated with one another as measured by ρpq. However ρpq is certainly dependent

upon the patch-size q × q that is used for computation of the statistics. Hence, we sum

up the cross-correlation values over q and over all patch pairs, thus giving us

ρp =∑

i∈Ω,j∈Ω,q

ρpq(vi , vj) (7–33)

as the final measure. Here vi and vj denote patches (in vector form) with their upper left

corner at locations i and j (respectively) in the image domain Ω. The patch-size p × p

which produces the least value of ρp is selected as the optimal parameter value. In our

experiments, we varied p from 3 to 16. We have observed that the PSNR corresponding

159

to the optimal ρp is very close to the optimal PSNR. This can be seen in Table 7-18

where for each image in the benchmark database, we report the following: (1) the

highest PSNR across p ∈ 3, 4, 5, ..., 15, 16, (2) the patch-size which produced that

PSNR, (3) the lowest ρp value across p, (4) the patch-size which produced the lowest ρp

value and (5) the PSNR for the best patch-size as per the criterion ρp. One can see from

Table 7-18 that the drop in PSNR (if any) is very low. The denoised images and their

residuals for different patch-sizes are also shown alongside in Figures 7-26 and 7-27.

The noise-level for all these results is σ = 20.

Ideally, there may not be a single optimal patch-size for the entire image. A better

approach would be to adapt the patch-size based on the local structure of the image.

However, given the aggregation of hypotheses from (and consequent dependence on)

neighboring patches, this turns out to be a non-trivial problem.

7.10 Denoising with Higher Order Singular Value Decomposition

We now present a second algorithm for image denoising, which is also rooted in

the non-local basis learning framework. The main difference is that this algorithm now

groups together similar patches as a 3D stack and filters the entire stack using a 3D

transform - namely, the higher order singular value decomposition (HOSVD) of the stack.

The core idea of grouping together similar patches and applying 3D transforms is taken

from the BM3D algorithm which was described in Section 6.5 and in greater detail in

Section 7.8.3. The main difference is that we incorporate this notion in a basis learning

strategy unlike BM3D.

7.10.1 Theory of the HOSVD

The higher order singular value decomposition (HOSVD) is the extension of the

SVD of (2D) matrices to higher-order matrices (often called tensors). The HOSVD was

first proposed in the psychology literature by Tucker for the case of 3D matrices where it

was called the Tucker3 decomposition [164]. A very extensive development of the theory

160

of HOSVD for matrices of all orders is presented in the thesis of Lathauwer [165], from

where the following brief description is summarized.

Given a higher order matrix A ∈ RN1×N2×...×ND , the HOSVD decomposes it in the

following manner

A = S ×1 U(1) ×2 U(2) ×3 ...×D U(D) (7–34)

where U(1) ∈ RN1×N1, U(2) ∈ RN2×N2,...,U(D) ∈ RND×ND are all orthonormal matrices, and

S ∈ RN1×N2×...×ND is a higher order matrix that satisfies some special properties. Here,

the symbol ×n stands for the nth mode tensor product defined in [165]. Fixing the nth

index to α, let the subtensor of S be denoted as Sn,α. Then S satisfies < Sn,α,Sn,β >= 0

∀α, β, n where α 6= β. This is called as the all-orthogonality property. Furthermore, we

also have ‖Sn,1‖ ≥ ‖Sn,2‖ ≥ ... ≥ ‖Sn,Nn‖ for all n.

Let us visualize A as a hypercube whose edges are coincident with the Cartesian

axes. The nth unfolding of A can be visualized as the tensor obtained by slicing A

parallel to the plane spanned by the Cartesian axes of the first and nth dimensions and

then arranging the slices in succession to yield a 2D matrix. In practice, the HOSVD can

be computed from the SVD of suitable unfoldings of the higher-order matrix A. It turns

out that Equation 7–34 has the following equivalent representation in terms of tensor

unfoldings [165]:

A(n) = U(n) · S(n) · (U(n+1) ⊗ U(n+2) ⊗ ...⊗ U(D) ⊗ U(1) ⊗ U(2)...⊗ U(n−1))T (7–35)

For a thorough introduction to multi-linear algebra and the HOSVD, we refer the reader

to [165]. An interesting application of the HOSVD to face recognition is presented in

[166].

7.10.2 Application of HOSVD for Denoising

We now describe how the HOSVD is applied for joint denoising of multiple image

patches. For each reference patch in the noisy image, all patches similar to it are

collected and represented as a 3D array Z ∈ Rp×p×K , where the patches have size p × p

161

and K is the number of similar patches in the ensemble (note that K is spatially varying).

A patch P is said to be similar to the reference patch if ‖P − Pref ‖2 ≤ τ 2d where τd is

defined earlier in Section 7.7.1. The HOSVD of Z is given as follows

Z = S ×1 U(1) ×2 U(2) ×3 U(3) (7–36)

where the orthonormal matrices U(1) ∈ Rp×p, U(2) ∈ Rp×p and U(3) ∈ RK×K can be

computed from the SVD of the unfoldings Z(1), Z(2) and Z(3) respectively. The exact

equations are as follows:

Z(1) = U(1) · S(1) · (U(2) ⊗ U(3))T (7–37)

Z(2) = U(2) · S(2) · (U(3) ⊗ U(1))T (7–38)

Z(3) = U(3) · S(3) · (U(1) ⊗ U(2))T . (7–39)

However, the complexity of the SVD computations for K × K matrices is O(K 3).

To prevent the computations from getting unwieldy, we put an upper cap on the number

of allowed similar patches, i.e. we impose the constraint that K ≤ 30. The patches

from Z are then projected onto the HOSVD transform. The parameter for thresholding

the transform coefficients is picked to be σ√2 log p2K , again as per the rule from [113].

The stack Z is then reconstructed after inverting the transform thereby filtering all

the individual patches. Note that unlike NL-SVD (see Section 7.7.3), we filter all the

individual patches in the ensemble and not just the reference patch. This afforded

additional smoothing on all the patches which was required due to the upper limit of

K ≤ 30 unlike the case with NL-SVD. Again, the reference patch is moved in a sliding

window fashion and the hypotheses appearing at each pixel are averaged to produce

the final filtered image.

7.10.3 Outline of HOSVD Algorithm

The HOSVD for denoising is outlined here below:

1. Divide the image into overlapping patches of size p × p.

162

2. For each patch P ref (called as ‘reference patch’), find patches Pi from the imagethat are similar to it in the sense explained in Section 7.7.1.

3. Stack the similar patches in a 3D array Z ∈ Rp×p×K .

4. Compute the unfoldings Z(1), Z(2) and Z(3) and then compute their SVD to yield thematrices U(1), U(2) and U(3) respectively.

5. Compute any one unfolding of the tensor S , say S(1).

6. Set to zero all entries of S(1) that are smaller (in absolute value) than σ√2 log p2K .

7. Reconstruct the entire stack using Equation 7–37 which filters every patch in theensemble.

8. Repeat above steps for all image patches.

9. Aggregate all the hypotheses and average them to produce the final filtered image.

We would like to emphasize that there are two key differences between our HOSVD

algorithm and BM3D. Firstly, we learn a spatially varying basis whereas BM3D uses

universal bases (2D-DCT or biorthogonal wavelets depending upon the noise level,

followed by a Haar basis in the third dimension). Secondly, as BM3D stacks together

similar patches and performs a Haar transform in the third dimension, it thus implicitly

treats the patches as a signal in the third dimension. On the other hand, our HOSVD

method does not impose any such ‘signalness’ in the third dimension. In fact, scrambling

the order of the patches in the third dimension will produce the same values of the

projection coefficients, except for corresponding permutation operations. Indeed, both

HOSVD and NL-SVD do not treat the patches as signals in any dimension unlike bases

such as the DCT. The SVD of a patch is itself invariant to row and column permutations.

However this is not a problem, because it is unlikely to encounter patches from real

images that are row/column permutations of one another. On the other hand, the

ordering of patches in the third dimension (the choice of which is a free parameter) may

potentially alter the output of a denoising algorithm such as BM3D, whereas our method

will still remain invariant to this change.

163

7.11 Experimental Results with HOSVD

The PSNR results for HOSVD are presented in Tables 7.12, 7.12, 7-7, 7-9, 7-12,

7-14 and 7-16. The corresponding SSIM results can be found in Tables 7-4, 7-6,

7-8, 7-10, 7-13, 7-15 and 7-17. From these tables, it can be observed that HOSVD

is superior to KSVD, NL-Means, 3D-DCT and NL-SVD. Indeed, it is also superior to

BM3D1 at higher noise levels (σ ≥ 20) on most images in terms of PSNR/SSIM values,

though it lags slightly behind BM3D2. The average difference between the PSNR values

produced by HOSVD and BM3D2 at noise levels 10, 20 and 30 is 0.346, 0.281 and

0.343 respectively (see Tables 7.12, 7-9 and 7-14).

A comparison between NL-SVD and HOSVD reveals that the latter outperforms

the former on the weaker or finer edges or textures. We have observed that the images

denoised by HOSVD sometimes tend to have a faint grainy appearance. The reason

for this is that HOSVD smoothes an ensemble of patches by projection onto a common

basis followed by truncation of transform coefficients. We have observed experimentally

that this tends to slightly under-smooth the patches, when compared to patches that

are smoothed individually as in techniques like NL-SVD. The undermsoothing is

compensated for by the averaging operations and the filtering of all patches from the

stack. The faint grainy appearance can be mitigated by running a linear smoothing

filter, such as a PCA in the third dimension from patch stacks from the filtered output

of HOSVD (similar to the Wiener filter idea implemented in BM3D2 but with a learning

component involving PCA on the stack of corresponding pixels from similar patches),

which seems to improve subjective visual quality (in our opinion). We leave a rigorous

testing of this issue for future work.

7.12 Comparison of Time Complexity

We now present a time complexity analysis of all the competing algorithms. For

this, assume that the number of image pixels is N and the average time to compute

similar patches per reference patch is TS . Let us assume that the average number of

164

patches similar to the reference patch is K . Let the size of the patch be n × n. The time

complexity of NL-SVD is then O([TS + Kn3]N) because the eigendecomposition of a

n × n matrix is O(n3) and multiplication of two n × n matrices is also an O(n3) operation.

The BM3D implementation in [134] requires O(Kn3) time for the 2D transforms and

O(K 2n2) time for the 1D transforms, if the transforms are implemented using simple

matrix multiplication. This leads to a total complexity of O([TS + Kn3 + K 2n2]N).

If algorithms such as the fast Fourier transforms are used, this complexity reduces

to O([TS + Kn2 log n + n2K logK ]N). If we assume that n is o(K) (i.e. that the

average number of ‘similar’ patches is much greater than the patch width/height, a

very reasonable assumption to make), then NL-SVD is in fact better in terms of time

complexity than BM3D. The complexity of HOSVD is obtained as follows. Given a

patch stack of size n × n × K , the size of two of its unfoldings is n × nK , the SVD

of which consumes O(Kn3) time. The third unfolding has size K × n2, the SVD of

which consumes O(min(K 2n2,Kn4)) time. Hence the total complexity of the method is

O([TS + Kn3 +min(K 2n2,Kn4)]N).

Note again that NL-SVD and HOSVD follow the concept of matrix based patch

representations, in tune with the philosophy followed by [155], [152], [158], [167] and

[168]. We could have represented each n × n patch as a n2 × 1 vector and built a

covariance matrix of size n2 × n2 to produce the spatially adaptive bases. In fact,

such an approach was taken in [169]. However the complexity of such a method is

O([TS + Kn4 + n6]N) which is greater than ours. The KSVD technique also follows

similar vector-based patch representation and the K learned bases have size n2 × 1

(with K >> n2). An important point to be mentioned is that the SVD is a characteristic

of a matrix/patch. There is no analog to the SVD for the vectorial representation of the

patch.

165

A B C

D E F

Figure 7-1. Global SVD Filtering on the Barbara image: (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with rank1 truncation (PSNR = 14.7), (D) filtered image with rank 10truncation (PSNR = 20.17), (E) filtered image with rank 100 truncation(PSNR = 24.3), (F) filtered image with rank 200 truncation (PSNR = 23.03)

166

A B C

D E F

Figure 7-2. Patch-based SVD filtering on the Barbara image, (A) clean Barbara image,(B) noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with rank 1 truncation in each patch (PSNR = 23.9), (D) filtered imagewith rank 2 truncation in each patch (PSNR = 25.05), (E) filtered image withnullification of singular values below 3σ in each patch (PSNR = 23.42), (F)filtered image with truncation of singular values in each patch so as to matchnoise variance (PSNR = 25.8)

167

A B C

D E

Figure 7-3. Oracle filter with SVD, (A) clean Barbara image, (B) noisy image withGaussian noise of σ = 20 (PSNR = 22.11), (C) filtered image withnullification of all the values in the projection matrix below 3σ in each patch(PSNR = 36.9), (D) noisy image with Gaussian noise of σ = 40 (PSNR =22.11), (E) filtered image with nullification of all the values in the projectionmatrix below 3σ in each patch (PSNR = 31.34)

Figure 7-4. Fifteen synthetic patches

168

200400600400450500

−2000200−200

0200

−400−300−200−300−280−260

1002003000

200400

1002003000

200400

−400−2000−300−250−200

−200−1000−200−100

0

200300400200300400

−200−1000−200−100

0

−1000100−100

0100

−1000100−100

0100

−1000100−50

050

−1000100−50

050

−1000100−100

0100

−1000100−100

0100

−1000100−50

050

−400−300−200−300−280−260

−1000100−100

0100

−2000200−200

0200

−1000100−50

050

−2000200−100

0100

−2000200−200

0200

−1000100−100

0100

−2000200−200

0200

100200300100200300

−1000100−100

0100

−1000100−50

050

−1000100−100

0100

−1000100−100

0100

−1000100−50

050

−1000100−50

050

−1000100−100

0100

1002003000

200400

−1000100−100

0100

−2000200−100

0100

−1000100−100

0100

−1000100−100

0100

−1000100−100

0100

−1000100−50

050

−2000200−200

0200

−400−2000−300−250−200

−1000100−100

0100

−2000200−200

0200

−1000100−100

0100

−2000200−100

0100

−2000200−200

0200

−2000200−200

0200

−2000200−100

0100

−200−1000−400−200

0

−1000100−100

0100

−1000100−100

0100

−1000100−50

050

−1000100−50

050

−2000200−200

0200

−1000100−100

0100

−1000100−50

050

200300400200300400

−1000100−50

050

−2000200−200

0200

−2000200−200

0200

−2000200−200

0200

−2000200−200

0200

−1000100−100

0100

−2000200−200

0200

A

400500600506507508

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−5−4−3

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

400500600504506508

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

400500600504506508

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

400500600504506508

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

B

Figure 7-5. Threshold functions for DCT coefficients of (A) the sixth and (B) the seventhpatch from Figure 7-4

169

A B C

D E F

Figure 7-6. DCT filtering with MAP and MMSE methods, (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with MMSE estimator on non-overlapping 8× 8 patches (PSNR =26.26), (D) filtered image with MAP estimator on non-overlapping 8× 8patches (PSNR = 26.19), (E) filtered image with MMSE estimator onoverlapping 8× 8 patches (PSNR = 28.03), (F) filtered image with MAPestimator on overlapping 8× 8 patches (PSNR = 29.94)

170

A B C

D E F

Figure 7-7. DCT filtering with MAP and MMSE methods, (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with MMSE estimator on non-overlapping 8× 8 patches (PSNR =27.12), (D) filtered image with MAP estimator on non-overlapping 8× 8patches (PSNR = 26.9), (E) filtered image with MMSE estimator onoverlapping 8× 8 patches (PSNR = 29.1), (F) filtered image with MAPestimator on overlapping 8× 8 patches (PSNR = 29.94)

171

700800900808810812

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

400600800554556558

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−3.5

−3−2.5

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−5−4−3

−100 0 100−5−4−3

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−4−3−2

A

400500600504506508

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−6−4−2

400500600500505510

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

400500600504506508

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−4−3−2

400500600506507508

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−10

−50

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−4−3−2

−100 0 100−4−3−2

−100 0 100−6−4−2

−100 0 100−5−4−3

−100 0 100−5−4−3

−100 0 100−6−4−2

−100 0 100−6−4−2

−100 0 100−6−4−2

B

Figure 7-8. Threshold functions for coefficients of (A) the sixth and (B) the seventh patchfrom Figure 7-4 when projected onto SVD bases of patches from thedatabase

172

A B C

D E

Figure 7-9. SVD filtering with MAP and MMSE methods, (A) clean Barbara image, (B)noisy image with Gaussian noise of σ = 20 (PSNR = 22.11), (C) filteredimage with MMSE estimator with SVD bases of patches from the database,on overlapping 8× 8 patches (PSNR = 25.2), (D) filtered image with MAPestimator with SVD bases of patches from the database, on overlapping8× 8 patches (PSNR = 28.85), (E) filtered image with MAP estimator withSVD bases of the true patches, on overlapping 8× 8 patches (PSNR = 36.6)

A B C D

Figure 7-10. Motivation for robust PCA: though the patches are structurally different, thedifference between the two noisy patches falls below the threshold of 3σ2n2

173

A B

C

Figure 7-11. Barbara image, (A) reference patch, (B) patches similar to the referencepatch (similarity measured on noisy image which is not shown here), (C)correlation matrices (top row) and learned bases

174

A B

C

Figure 7-12. Mandrill image, (A) reference patch, (B) patches similar to the referencepatch (similarity measured on noisy image which is not shown here), (C)correlation matrices (top row) and learned bases

175

Figure 7-13. DCT bases (8× 8).

176

A B C

D E F

G

Figure 7-14. Barbara image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD

177

A B C

D E

Figure 7-15. Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E)HOSVD

178

A B C

D E F

G

Figure 7-16. Boat image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD

179

A B C

D E

Figure 7-17. Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E)HOSVD

180

A B C

D E F

G

Figure 7-18. Stream image: (A) clean image, (B) noisy version with σ = 20, PSNR = 22,(C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1, (F)output of BM3D2, (G) output of HOSVD

181

A B C

D E

Figure 7-19. Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E)HOSVD

182

A B C

D E F

G

Figure 7-20. Fingerprint image: (A) clean image, (B) noisy version with σ = 20, PSNR =22, (C) output of NL-SVD, (D) output of NL-Means, (E) output of BM3D1,(F) output of BM3D2, (G) output of HOSVD

183

A B C

D E

Figure 7-21. Residual with (A) NL-SVD, (B) NL-Means, (C) BM3D1, (D) BM3D2, (E)HOSVD

184

A B

C D

Figure 7-22. For σ = 20, denoised Barbara image with NL-SVD (A) [PSNR = 30.96] andDCT (C) [PSNR = 29.92]. For the same noise level, denoised boat imagewith NL-SVD (B) [PSNR = 30.24] and DCT (D) [PSNR = 29.95].

185

A B

C D

Figure 7-23. (A) Checkerboard image, (B) Noisy version of the image with σ = 20, (C)Denoised with NL-SVD (PSNR = 34) and (D) DCT (PSNR = 27). Zoom infor better view.

186

A B

C

Figure 7-24. Absolute difference between true Barbara image and denoised imageproduced by (A) NL-SVD, (B) BM3D1, (C) BM3D2. All three algorithmswere run on image with noise σ = 20.

A B C

Figure 7-25. A zoomed view of Barbara’s face for (A) the original image, (B) NL-SVDand (C) BM3D2. Note the shock artifacts on Barbara’s face produced byBM3D2.

187

A B C

D E F

G

Figure 7-26. Reconstructed images when Barbara (with noise σ = 20) is denoised withNL-SVD run on patch sizes (A) 4× 4, (B) 6× 6, (C) 8× 8, (D) 10× 10, (E)12× 12, (F) 14× 14 and (G) 16× 16.

188

A B C

D E F

G

Figure 7-27. Residual images when Barbara (with noise σ = 20) is denoised withNL-SVD run on patch sizes (A) 4× 4, (B) 6× 6, (C) 8× 8, (D) 10× 10, (E)12× 12, (F) 14× 14 and (G) 16× 16.

189

Table 7-1. Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for DCT bases

Patch MAP MAP MAP MMSE MMSE MMSE(avg) (max) (med.) (avg) (max) (med.)

1 16.38 172.45 3.45 3.80 28.44 3.162 81.63 234.06 74.90 44.19 196.18 33.883 77.54 268.92 67.74 43.20 163.49 34.334 57.23 232.21 49.25 3.72 26.77 3.115 60.63 221.48 51.17 3.94 60.67 3.196 799.24 1192.52 795.08 277.22 468.22 269.797 37.46 207.22 24.61 3.85 41.91 3.198 60.05 272.32 65.89 33.39 98.50 28.709 62.18 220.53 66.90 34.27 103.88 28.2310 63.17 200.45 51.20 29.62 55.40 29.6111 39.59 172.35 31.14 3.84 52.33 3.1512 42.01 250.68 32.06 7.21 313.86 3.2613 430.12 815.37 422.66 223.45 421.27 220.2314 425.47 768.06 416.78 221.48 409.32 214.9015 900.20 1494.50 890.41 318.90 599.51 314.54

Table 7-2. Avg, max and median error on synthetic patches from Figure 7-4 with MAPand MMSE estimators for SVD basis of the clean synthetic patch

Patch MAP MAP MAP MMSE MMSE MMSE(avg) (max) (med.) (avg) (max) (med.)

1 17.04 200.25 4.39 3.76 34.98 3.102 18.19 157.71 4.14 3.83 24.20 3.173 17.13 159.67 4.60 3.77 42.49 3.074 61.40 234.57 51.09 3.77 15.05 3.235 61.63 303.55 53.95 3.77 27.48 3.076 24.43 229.06 11.82 3.88 37.62 3.217 37.41 182.07 26.01 3.71 18.87 3.108 17.05 153.97 4.51 4.09 41.73 3.139 20.75 185.96 5.37 3.98 19.89 3.1910 41.95 200.29 20.79 23.49 120.35 9.2211 16.43 171.91 3.88 3.73 16.82 3.1912 16.88 200.24 4.04 4.01 50.55 3.1413 52.93 177.58 44.35 23.66 133.74 16.5614 56.33 291.71 44.30 23.44 141.31 17.1215 39.89 162.01 31.06 14.52 94.97 5.34

190

Table 7-3. PSNR values for noise level σ = 5 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 38.339 38.268 39.131 38.609 38.339 38.981 39.146 45.34712 37.434 37.119 38.044 37.693 37.436 37.963 38.143 45.02311 36.578 36.721 37.179 36.594 36.263 36.939 37.141 44.18310 36.623 36.833 37.260 36.825 36.597 37.260 37.379 44.2039 35.623 36.377 37.283 35.858 35.643 36.261 36.641 44.3988 35.913 35.300 36.624 36.109 36.108 36.227 36.410 45.9647 36.314 36.751 37.055 36.387 36.168 36.861 37.076 44.3416 37.850 37.903 38.554 37.997 37.662 38.459 38.534 44.1625 36.449 36.678 37.008 36.676 36.365 37.025 37.187 45.0784 34.699 34.968 35.181 34.739 34.660 35.007 35.148 45.4123 36.921 37.455 37.707 37.007 36.602 37.478 37.540 43.6722 35.004 35.325 35.544 35.161 35.114 35.529 35.643 45.5341 38.646 38.434 39.336 38.770 38.390 39.122 39.224 43.876

Table 7-4. SSIM values for noise level σ = 5 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.956 0.949 0.958 0.958 0.953 0.958 0.959 0.98612 0.961 0.957 0.963 0.963 0.958 0.963 0.964 0.98911 0.936 0.938 0.940 0.933 0.922 0.935 0.938 0.98610 0.946 0.946 0.949 0.946 0.940 0.948 0.950 0.9889 0.900 0.916 0.932 0.903 0.894 0.908 0.918 0.9878 0.986 0.984 0.988 0.987 0.987 0.987 0.987 0.9997 0.937 0.940 0.943 0.936 0.930 0.939 0.943 0.9886 0.941 0.939 0.945 0.941 0.934 0.943 0.943 0.9815 0.947 0.944 0.948 0.949 0.941 0.949 0.951 0.9894 0.953 0.958 0.958 0.952 0.949 0.955 0.958 0.9953 0.920 0.928 0.928 0.918 0.906 0.922 0.922 0.9782 0.959 0.961 0.962 0.960 0.959 0.962 0.964 0.9951 0.941 0.936 0.944 0.942 0.935 0.942 0.943 0.978

191

Table 7-5. PSNR values for noise level σ = 10 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 35.137 34.213 35.664 35.144 34.905 35.544 35.867 41.72512 34.032 33.044 34.386 34.459 34.050 34.536 34.882 41.17711 33.320 32.743 33.623 33.392 32.847 33.635 33.855 39.18310 33.235 32.674 33.493 33.377 33.082 33.781 33.993 39.8829 33.003 32.764 33.942 33.320 32.548 33.287 33.304 37.9038 31.631 31.464 32.386 31.614 31.938 32.131 32.427 40.6527 33.009 32.748 33.398 33.066 32.451 33.379 33.613 39.3716 35.166 33.965 35.460 35.336 35.015 35.576 35.825 40.5775 32.514 32.339 32.835 32.474 32.132 32.950 33.208 40.1314 29.989 30.262 30.486 29.484 29.886 30.353 30.534 39.2263 34.521 33.781 34.807 34.728 34.291 34.913 35.003 39.3162 30.380 30.760 30.931 29.807 30.262 30.831 31.099 39.5301 36.242 34.399 36.542 36.447 36.062 36.527 36.808 40.864

Table 7-6. SSIM values for noise level σ = 10 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.928 0.887 0.931 0.927 0.925 0.929 0.935 0.97512 0.934 0.903 0.934 0.938 0.933 0.937 0.942 0.97811 0.879 0.866 0.883 0.886 0.862 0.884 0.888 0.95610 0.895 0.878 0.898 0.903 0.886 0.904 0.908 0.9699 0.818 0.822 0.853 0.832 0.789 0.822 0.819 0.9358 0.963 0.961 0.968 0.961 0.964 0.966 0.969 0.9957 0.873 0.862 0.879 0.878 0.850 0.879 0.885 0.9646 0.908 0.870 0.909 0.912 0.904 0.911 0.915 0.9635 0.886 0.869 0.884 0.890 0.870 0.889 0.895 0.9714 0.885 0.896 0.896 0.876 0.868 0.891 0.897 0.9773 0.879 0.858 0.882 0.885 0.872 0.883 0.882 0.9462 0.892 0.897 0.902 0.877 0.877 0.898 0.906 0.9831 0.911 0.862 0.913 0.914 0.909 0.913 0.916 0.961

192

Table 7-7. PSNR values for noise level σ = 15 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 33.262 32.223 33.597 33.341 32.825 33.506 33.949 39.50212 32.283 31.363 32.375 32.836 32.008 32.587 33.057 38.76611 31.454 30.566 31.706 31.704 30.899 31.710 32.039 36.91010 31.320 30.342 31.394 31.576 30.963 31.725 32.049 37.6379 31.854 31.355 32.271 32.201 31.603 32.095 32.132 35.3648 29.537 29.159 30.051 29.798 29.229 29.904 30.262 37.5627 31.296 30.617 31.508 31.651 30.703 31.604 31.865 36.8356 33.487 32.166 33.712 33.688 33.222 33.737 34.133 38.5065 30.388 29.835 30.482 30.568 29.759 30.700 30.973 37.0134 27.557 27.461 27.969 27.101 27.305 27.881 28.166 35.7933 33.138 32.107 33.199 33.326 32.852 33.375 33.594 37.5342 28.129 28.080 28.564 27.761 27.619 28.405 28.718 35.9051 34.628 32.817 34.748 34.848 34.375 34.771 35.270 39.248

Table 7-8. SSIM values for noise level σ = 15 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.905 0.855 0.910 0.898 0.902 0.901 0.916 0.96712 0.910 0.876 0.909 0.916 0.909 0.913 0.923 0.96911 0.836 0.810 0.841 0.849 0.822 0.845 0.853 0.93510 0.853 0.821 0.852 0.866 0.844 0.864 0.874 0.9559 0.771 0.773 0.789 0.791 0.758 0.782 0.776 0.8868 0.940 0.934 0.946 0.943 0.931 0.944 0.949 0.9907 0.820 0.799 0.824 0.838 0.799 0.832 0.841 0.9426 0.881 0.837 0.884 0.883 0.880 0.883 0.893 0.9495 0.828 0.799 0.823 0.838 0.804 0.834 0.842 0.9504 0.823 0.826 0.835 0.814 0.793 0.831 0.842 0.9563 0.855 0.822 0.855 0.857 0.852 0.857 0.861 0.9292 0.825 0.818 0.837 0.814 0.785 0.831 0.845 0.9641 0.887 0.836 0.888 0.888 0.888 0.887 0.897 0.952

193

Table 7-9. PSNR values for noise level σ = 20 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 31.936 30.540 32.266 32.015 31.433 32.028 32.552 37.69512 30.878 29.421 30.762 31.533 30.543 31.026 31.660 36.60311 30.187 28.911 30.360 30.491 29.596 30.395 30.802 35.51010 29.961 28.387 29.929 30.299 29.422 30.252 30.698 36.0669 31.135 29.924 31.341 31.354 30.887 31.284 31.433 34.1788 28.053 27.424 28.454 28.563 27.389 28.403 28.794 35.3187 30.098 28.931 30.166 30.536 29.532 30.397 30.726 35.2416 32.240 30.473 32.371 32.411 31.903 32.375 32.950 36.9755 28.939 27.995 28.853 29.291 28.250 29.200 29.464 34.9894 25.976 25.933 26.372 25.720 25.543 26.260 26.582 33.4343 32.009 30.357 32.005 32.166 31.740 32.138 32.498 36.2602 26.800 26.375 27.062 26.722 25.892 26.918 27.192 33.4851 33.401 30.902 33.494 33.525 33.169 33.430 34.075 37.971

Table 7-10. SSIM values for noise level σ = 20 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.885 0.802 0.893 0.869 0.885 0.875 0.899 0.95912 0.882 0.821 0.877 0.897 0.884 0.884 0.903 0.95611 0.801 0.753 0.803 0.814 0.789 0.809 0.824 0.92210 0.816 0.755 0.812 0.831 0.806 0.828 0.845 0.9459 0.747 0.723 0.755 0.761 0.740 0.755 0.754 0.8598 0.914 0.903 0.922 0.926 0.899 0.922 0.930 0.9847 0.778 0.736 0.776 0.800 0.761 0.793 0.807 0.9246 0.858 0.782 0.861 0.852 0.861 0.855 0.875 0.9385 0.775 0.729 0.768 0.792 0.753 0.784 0.796 0.9304 0.765 0.760 0.780 0.764 0.722 0.776 0.792 0.9363 0.835 0.769 0.835 0.830 0.836 0.831 0.843 0.9182 0.764 0.746 0.773 0.767 0.700 0.771 0.786 0.9421 0.867 0.777 0.869 0.859 0.871 0.862 0.880 0.944

194

Table 7-11. PSNR values: NL-SVD versus DCT for noise level σ = 20 on the benchmarkdataset

Image # NL-SVD DCTcheckerboard 34.5 27.2

13 31.93 31.7612 30.88 29.9311 30.20 29.9510 29.96 29.739 31.13 31.068 28.05 28.087 30.10 29.906 32.24 32.095 29.94 28.574 25.98 25.823 32.00 31.672 26.80 26.491 33.40 33.48

Table 7-12. PSNR values for noise level σ = 25 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 30.845 28.835 31.145 31.038 30.324 30.955 31.512 36.21212 29.766 28.135 29.552 30.439 29.367 29.870 30.595 35.03011 29.207 27.495 29.188 29.482 28.476 29.306 29.782 34.29210 28.736 26.916 28.689 29.297 28.215 29.120 29.639 34.7949 30.484 28.906 30.629 30.698 30.287 30.540 30.880 33.3898 26.834 25.764 27.225 27.422 26.218 27.262 27.719 33.6697 29.094 27.608 29.152 29.588 28.690 29.433 29.834 34.0746 31.329 29.202 31.279 31.470 30.852 31.294 31.957 35.6785 27.763 26.367 27.617 28.200 27.136 28.061 28.340 33.3444 24.929 24.488 25.204 25.019 24.186 25.097 25.435 31.6353 30.993 28.802 30.801 31.349 30.785 31.170 31.626 35.2392 25.792 24.894 25.900 25.823 24.741 25.876 26.106 31.6231 32.456 29.584 32.367 32.685 32.127 32.343 33.120 36.856

195

Table 7-13. SSIM values for noise level σ = 25 on the benchmark datasetImage # NL-SVD NL-Means KSVD HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.868 0.760 0.876 0.854 0.870 0.852 0.885 0.95112 0.850 0.780 0.849 0.870 0.860 0.857 0.884 0.94311 0.771 0.700 0.769 0.784 0.758 0.777 0.799 0.90910 0.777 0.697 0.773 0.800 0.769 0.792 0.816 0.9359 0.729 0.687 0.733 0.737 0.727 0.732 0.739 0.8418 0.886 0.864 0.896 0.906 0.874 0.902 0.913 0.9777 0.740 0.681 0.739 0.766 0.732 0.759 0.778 0.9106 0.840 0.738 0.842 0.834 0.842 0.828 0.858 0.9285 0.730 0.664 0.723 0.749 0.711 0.741 0.756 0.9114 0.713 0.690 0.725 0.720 0.652 0.723 0.745 0.9153 0.815 0.725 0.815 0.814 0.821 0.808 0.828 0.9092 0.707 0.667 0.713 0.715 0.636 0.719 0.735 0.9191 0.849 0.735 0.850 0.844 0.854 0.836 0.864 0.936

196

Table 7-14. PSNR values for noise level σ = 30 on the benchmark datasetImage # NL-SVD NL-Means HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 29.875 27.680 30.079 29.437 30.101 30.711 35.09812 28.639 26.853 29.462 28.498 28.952 29.793 33.69611 28.305 26.368 28.650 27.656 28.466 29.017 33.38010 27.740 25.665 28.290 27.268 28.154 28.759 33.6359 29.997 27.865 29.976 29.770 29.947 30.420 32.7708 25.863 24.552 26.676 25.434 26.382 26.874 32.2517 28.357 26.576 28.798 27.993 28.654 29.145 33.1606 30.233 28.080 30.411 30.000 30.417 31.194 34.5975 26.778 25.176 27.278 26.248 27.109 27.353 32.0414 24.139 23.396 24.293 23.094 24.208 24.551 30.2333 29.996 27.438 30.150 29.824 30.164 30.673 34.2592 24.884 23.858 25.278 24.041 25.115 25.336 30.2891 31.549 28.388 31.385 31.166 31.267 32.130 35.861

Table 7-15. SSIM values for noise level σ = 30 on the benchmark datasetImage # NL-SVD NL-Means HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.853 0.717 0.815 0.856 0.832 0.873 0.94412 0.824 0.730 0.836 0.840 0.833 0.868 0.93011 0.746 0.650 0.752 0.736 0.750 0.779 0.90110 0.742 0.639 0.760 0.737 0.757 0.790 0.9259 0.715 0.644 0.710 0.716 0.712 0.727 0.8298 0.860 0.826 0.890 0.851 0.881 0.895 0.9687 0.713 0.631 0.733 0.707 0.728 0.753 0.8986 0.815 0.688 0.791 0.825 0.803 0.843 0.9185 0.690 0.610 0.710 0.677 0.704 0.722 0.8914 0.664 0.624 0.686 0.587 0.677 0.701 0.8963 0.796 0.676 0.771 0.804 0.782 0.811 0.9002 0.649 0.609 0.687 0.592 0.675 0.690 0.8971 0.829 0.683 0.797 0.837 0.806 0.845 0.929

197

Table 7-16. PSNR values for noise level σ = 35 on the benchmark datasetImage # NL-SVD NL-Means HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 28.905 26.542 29.322 28.625 29.259 29.897 33.83412 27.497 25.663 28.547 27.559 27.944 28.914 32.18811 27.419 25.330 27.903 26.869 27.664 28.287 32.52310 26.931 24.750 27.572 26.522 27.342 28.010 32.7969 29.436 26.895 29.533 29.247 29.313 29.913 32.1978 24.925 23.628 25.822 24.701 25.516 26.065 31.1287 27.692 25.732 28.116 27.354 27.930 28.440 32.3786 29.466 27.057 29.847 29.386 29.676 30.545 33.7785 25.948 24.237 26.477 25.531 26.312 26.590 31.0714 23.385 22.565 23.537 22.394 23.408 23.804 29.0553 29.101 26.321 29.469 28.992 29.302 29.900 33.3582 24.167 23.078 24.575 23.504 24.466 24.730 29.1551 30.824 27.275 30.822 30.462 30.450 31.349 34.869

Table 7-17. SSIM values for noise level σ = 35 on the benchmark datasetImage # NL-SVD NL-Means HOSVD 3DDCT BM3D1 BM3D2 Oracle

13 0.838 0.674 0.810 0.843 0.814 0.860 0.93712 0.790 0.681 0.815 0.813 0.801 0.846 0.91211 0.716 0.600 0.727 0.709 0.718 0.755 0.89210 0.711 0.588 0.736 0.710 0.724 0.766 0.9189 0.702 0.602 0.698 0.705 0.691 0.715 0.8218 0.831 0.793 0.869 0.829 0.859 0.877 0.9607 0.687 0.588 0.707 0.684 0.698 0.728 0.8886 0.798 0.640 0.783 0.813 0.779 0.830 0.9095 0.655 0.564 0.678 0.647 0.669 0.693 0.8754 0.611 0.569 0.632 0.538 0.630 0.657 0.8753 0.778 0.631 0.761 0.789 0.759 0.797 0.8922 0.602 0.564 0.642 0.559 0.637 0.654 0.8741 0.814 0.632 0.790 0.824 0.782 0.830 0.922

198

Table 7-18. Patch-size selection for σ = 20Image # Best PSNR Best scale Best ρ Best scale Best PSNR

(by PSNR) (by ρ) by ρ13 32.000 8 9.648 14 31.69012 30.990 10 9.650 11 30.95811 30.260 8 9.640 14 29.91010 30.030 9 9.650 11 29.9889 31.210 8 9.713 16 30.9688 28.120 8 9.680 6 28.1107 30.190 8 9.639 14 30.0246 32.350 10 9.641 14 32.1905 29.190 5 9.644 9 28.8904 26.166 4 9.756 6 26.0443 32.020 8 9.639 12 31.8082 27.020 5 9.673 6 27.0201 33.510 10 9.635 11 33.499

199

CHAPTER 8AUTOMATED SELECTION OF FILTER PARAMETERS

8.1 Introduction

Despite the vast body of literature on image denoising, relatively little work has

been done in the area of automatically choosing the filter parameters that yield optimal

filter performance. The typical denoising technique requires tuning parameters that are

critical for its optimal performance. In denoising experiments reported in contemporary

literature, the filter performance is usually measured using a full-reference image

quality measure (such as the MSE/PSNR or SSIM) between the denoised and the

original image. The parameters are picked so as to yield the optimal value of the

quality measure for a particular filter, but this requires knowledge of the original image

and is not extensible to real-world denoising situations. Hence we need criteria for

parameter selection that do not refer to the original image. In this chapter, we classify

these criteria into two types: (1) independence-based criteria that measure the degree

of independence between the denoised image and the residual, and (2) criteria that

measure how noisy the residual image is, without referring to the denoised image. We

contribute to and critique criteria of type (1), and proposed those of type (2). Our criteria

make the assumption that the noise is i.i.d. and additive, and that a loose lower-bound

on the noise variance is known.

The material in this chapter is based on the author’s published work in [170]1 . This

chapter is organized as follows: Section 8.2 reviews existing literature for filter parameter

selection, followed by a description of the proposed criteria in Section 8.3, experimental

results in Section 8.4 and discussion in Section 8.5.

1 Parts of the contents of this chapter have been reprinted with permission from: A.Rajwade, A. Rangarajan and A. Banerjee, ‘Automated Filter Parameter Selection usingMeasures of Noiseness’, Canadian Conference on Computer and Robot Vision, pages86-93, June 2010. c©2010, IEEE.

200

8.2 Literature Review on Automated Filter Parameter Selection

In PDE-based denoising, the choice of stopping time for the PDE evolution is a

crucial parameter. Some researchers propose to stop the PDE when the variance of

the residual equals the variance of the noise, which is assumed to be known [86], [171].

This method ignores higher order statistics of the noise. Others use a hypothesis test

between the empirical distribution of the residual and the true noise distribution [139] for

polynomial order selection in regression-based smoothing. However the exact variance

of the noise or its complete distribution is usually not known in practical situations.

A decorrelation-based criterion independently proposed in [172] and [173] does not

require any knowledge of the noise distribution except that the noise is independent

of the original signal. As per this criterion, the optimal filter parameter is chosen to be

one which minimizes the correlation coefficient between the denoised and the residual

images, regardless of the noise variance. This criterion however has some problems: (1)

in the limit of extreme over-smoothing or under-smoothing, the correlation coefficient is

undefined as the denoised image could become a constant image, (2) it is too global a

criterion (though using a sum of local measures is a ready alternative) and (3) it ignores

higher-order dependencies. A solution to the third issue is suggested by us in Section

8.3.

It should be noted that all the aforementioned criteria (as also the ones we suggest

in this chapter) are necessary but not sufficient for parameter selection. Gilboa et

al. [174] attempt to alleviate this by selecting a stopping time that seeks to maximize

the signal-to-noise-ratio (SNR) directly. Their method however requires an estimate

of the rate of change of the covariance between the residual and the noise w.r.t. the

filtering parameter. This estimate in turn requires full knowledge of the noise distribution.

Saddled with this method is the assumption that the covariance between the residual for

any image and the actual noise, can be estimated from a single noise-image (generated

from the same noise distribution) on which the filtering algorithm is run. This assumption

201

is not justified theoretically though experimental results are impressive (see [174] for

more details). Vanhamel et al. [175] propose a criterion that maximizes an estimate

of the correlation between the denoised image and the true, underlying image. This

estimate, however, can be computed only by using some assumptions that have only

experimental justification. In wavelet thresholding methods, risk based criteria have

been proposed for the optimal choice of the threshold for the wavelet coefficients. These

methods such as those in [113], or the SURE - Stein’s unbiased risk estimator from

[176], again require knowledge of the noise model including the noise variance value.

Recently, Brunet et al. have developed no-reference quality estimates of the MSE

between the denoised image and the true underlying image [141]. These estimates do

not require knowledge of the original image, but they do require knowledge of the noise

variance and obtain a rough, heuristic estimate of the covariance between the residual

and the noise. Moreover the performance of these estimates has been tested only on

Gaussian noise models.

8.3 Theory

8.3.1 Independence Measures

In what follows, we shall denoted the denoised image obtained by filtering a noisy

image I as D, its corresponding residual as R (note that R = I − D) and the true

image underlying I as J. As mentioned earlier, independence-based criteria have been

developed in image processing literature. In cases where a noisy signal is oversmoothed

(locally or globally), the residual image clearly shows the distinct features from the

image (referred to as ‘method noise’ in [2]). This is true even in those cases where

the noise is independent of the signal. Independence-based criteria are based on the

assumption that when the noisy image is filtered optimally, the residual would contain

mostly noise and very little signal and hence it would be independent of the denoised

image. It has been experimentally reported in [172] that the absolute correlation

coefficient (denoted as CC ) between D and R decreases almost monotonically as

202

the filter smoothing parameter is increased (in discrete steps) from a lower bound to

a certain ‘optimal’ value, after which its value increases steadily until an upper bound.

However, CC ignores anything higher than second-order dependencies. To alleviate

this problem, we propose to minimize the mutual information (MI) between D and R,

as a criterion for parameter selection. This has been proposed as a (local) measure

of noiseness earlier in [141], but it has been used in that paper only as an indicator

of areas in the image where the residual is unfaithful to the noise model, rather than

as an explicit parameter-selection criterion. In this chapter, we also propose to use

the following information-theoretic measures of correlation from [39] (see page 47) as

independence criteria:

η1(R,D) = 1−H(R|D)H(D)

=MI (R,D)

H(D)(8–1)

η2(R,D) = 1−H(D|R)H(R)

=MI (R,D)

H(R). (8–2)

Here H(X ) refers to the Shannon entropy of X , and H(X |Y ) refers to the conditional

Shannon entropy of X given Y . η1 and η2 both have values bounded from 0 (full

independence) to 1 (full independence).

A problem with all these criteria (CC,MI,η1, η2) lies in the inherent probabilistic

notion of independence itself. In the extreme case of oversmoothing, the ‘denoised’

image may turn out to have a constant intensity, whereas in the case of extreme

undersmoothing (no smoothing or very little smoothing), the residual will be a constant

(zero) signal. In such cases, CC , η1, η2 are ill-defined whereas MI turns out to be

zero (its least possible value). What this indicates is that these criteria have the

innate tendency to favor extreme cases of under- or over-smoothing. In practical

applications, one may choose to get around this issue by choosing a local minimum of

these measures within a heuristically chosen interval in the parameter landscape from

0 to ∞, but we wish to drive home a more fundamental point about the inherent flaw in

using independence measures. Moreover, it should be noted that localized versions of

203

these measures (i.e. sum of local independence measures) may produce false optima if

the filtering algorithm smoothes out local regions with fine textures.

8.3.2 Characterizing Residual ‘Noiseness’

Given the fact that the assumed noise model is i.i.d. and signal independent,

we expect the residual produced by an ideal denoising algorithm to obey these

characteristics. Therefore, patches from residual images are expected to have similar

distributions if the filtering algorithm has performed well. Our criterion for characterizing

the residual ‘noiseness’ is rooted in the framework of statistical hypothesis testing. We

choose the two-sample Kolmogorov-Smirnov (KS) test to check statistical homogeneity.

The two-sample KS test-statistic is defined as

K = supx

|F1(x)− F2(x)| (8–3)

where F1(x) and F2(x) are the respective empirical cumulative distribution functions

(ECDF) of the two samples, computed with N1 and N2 points. Under the null hypothesis

when N1 → ∞,N2 → ∞, the distribution of K tends to the Kolmogorov distribution, and

is therefore independent of the underlying true CDFs themselves. Therefore the K value

has a special meaning in statistics. For a ‘significance level’ α (the probability of falsely

rejecting the null hypothesis that the two ECDFs were the same), let Kα be the statistic

value such that P(K ≤ Kα) = 1−α. The null hypothesis is said to be rejected at level α if√N1N2N1+N2

K > Kα. Given a value of the test-statistic computed empirically from the samples

(denoted as K ), we term P(K ≤ K) (under the null-hypothesis) as the p-value.

Most natural images (apart from homogenous textures) show a considerable degree

of statistical dissimilarity. To demonstrate this, we performed the following experiment

on all 300 images from the Berkeley database [61]. Each image at four scales with

successive downsampling factor of 23

was tiled into non-overlapping patches of sizes

s × s where s ∈ 16, 24, 32. The two-sample KS test for α = 0.05 was performed

for patches from these images. The average rejection rate was 81% which indicates

204

that different regions from each image have different distributions. It should be noted

that the tiling of the image into patches was very important: a KS test between sample

subsets from random (non-contiguous) locations produced very low reject rates. A

similar experiment with the same scales and patch sizes run on pure Gaussian noise

images resulted in a rejection rate of only 7% for α = 0.05. Next, Gaussian noise of

σ = 0.005 (for intensity range [0,1]) was added to each image. Each image was filtered

using the Perona-Malik filter [44] for 90 iterations with a step size of 0.05 and edgeness

criterion of γ = 40 and the residual images were computed after the last iteration. The

KS-test was performed at α = 0.05 between patch pairs from each residual image. The

resulting rejection rate was 41%, indicating strong heterogeneity in the residual values.

As structural patterns were clearly visible in all these residual images, we therefore

conjecture that statistical heterogeneity is a strong indicator of the presence of structure.

Moreover the percentage reject rate (denoted as h), the average value of the KS-statistic

(i.e. K ) and the average negative logarithm of the p-values from each pairwise test

(denoted as P) are all indicators of the ‘noiseness’ of a residual (the lower the value, the

noisier and hence more desirable the residual). Hence these measures act as criteria

for filter parameter selection2 . We prefer the criteria P and K to h because they do not

require a significance level to be specified a priori.

The advantage of the KS-based measure over MI or CC is that values of P and

K are high in cases of image oversmoothing (as the residual will then contain more

and more structure). This is unlike MI or CC which will attain false minima. This is

demonstrated in Figure 8-1 where the decrease in the values of MI or CC at high

smoothing levels is quite evident. Just like MI or CC, the KS-based criteria do not

require knowledge of the noise distribution or even the exact noise variance. However

2 For computing P, there is the assumption that the pairwise tests between individualpatches are all independent, for the sake of simplicity.

205

all these criteria could be fooled by the pathological case of zero or very low denoising.

This is because in the very initial stages of denoising (obtained by, say, running a

PDE with a very small stepsize for very few iterations), the residual is likely to be

devoid of structure and independent of the (under)smoothed image. Consequently,

all measures: MI, CC, K and P will acquire (falsely) low values. This problem can be

avoided by making assumptions of the range of values for the noise variance (or a loose

lower-bound), without requiring exact knowledge of the variance. This has been the

strategy followed implicitly in contemporary parameter selection experiments (e.g. in

[172] the PDE stepsizes are chosen to be 0.1 and 1). In all our experiments, we make

similar assumptions. The exception is that KS-based measures do not require any upper

bound on the variance to be known: just a lower bound suffices.

8.4 Experimental Results

To demonstrate the effectiveness of the proposed criteria, we performed experiments

on all the 13 images from the benchmark dataset. All images from the dataset were

down-sized from 512 × 512 to 256 × 256. We experimented with 6 noise levels

σ2n ∈ 10−4, 5 × 10−4, 10−3, 5 × 10−3, 0.01, 0.05 on an intensity range of [0,1], and

with two additive zero-mean noise distributions: Gaussian (the most commonly used

assumption) and bounded uniform (noise due to quantization). The lower-bound

assumed on the noise variance was 10−6 in all experiments. Two filtering algorithms

were tested: the non-local means (NL-Means) filter [2] and total variation (TV) [86]. The

equation for the NL-Means filter is as follows:

I (x) =

∑xk∈N(x ;SR) wk(x)I (xk)∑xk∈N(x ;SR) wk(x)

, (8–4)

wk(x) = exp(− ‖q(x ;QR)− q(xk ;QR)‖2

σ) (8–5)

where I (x) is the estimated smoothed intensity, N(x ;SR) is a search window of diameter

SR around point x , wk(x) is a weight factor, q(x ;QR) is a patch of diameter QR centered

206

at x and σ is a smoothing parameter3 . In our experiments, a patch size of 12 × 12

was used, with a search window of 50 x 50. σ was chosen by running the NL-Means

algorithm for 55 different σ values for smoothing, from the set 1 : 1 : 10, 20 : 20 :

640, 650 : 50 : 1200. The optimal σ values were computed using the following

criteria: CC(D,R), MI (D,R), η1(D,R), η2(D,R); sum of localized versions of all

above measures on a 12 × 12 window; h, P and K using two-sample KS tests on

non-overlapping 12 × 12 patches; and hn, Pn and Kn values computed using KS-test

between the residual and the true noise samples (which we know here as these are

synthetic experiments). All information theoretic quantities were computed using 40 bins

as the image size was 256 × 256 (the thumb rule for the optimal number of bins for n

samples is O(n1/3)).

The total variation (TV) filter is obtained by minimizing the following energy:

E(I ) =

∫Ω

|∇I (x)|dx (8–6)

for an image defined on the domain Ω, which gives a geometric heat equation PDE

that is iterated some T times starting from the given noisy image as an initial condition.

The stopping time T is the equivalent of the smoothing parameter here. For the TV

filter, in addition to all the criteria mentioned before, we also tested the method in [174]

(assuming knowledge of the noise distribution) .

8.4.1 Validation Method

In order to validate the σ or T estimates produced by these criteria, it is important

to see how well they are in tune with those filter parameter values that are optimal with

regard to different well-known quality measures between the denoised image and the

original image. The most commonly used quality measure is the MSE. However as

3 Note that we use σ to denote the smoothing parameter of the filtering algorithm andσ2n to denote the variance of the noise.

207

documented in [150] and mentioned in Chapter 6, MSE has several limitations. Hence

we also experimented with structured similarity index (SSIM) developed in [151]; with the

L1 difference between the denoised and the original image; and with the CC,MI, η1 and

η2 values between the denoised and the original image (as well as with the sum of their

local versions).

8.4.2 Results on NL-Means

Results on NL-Means for Gaussian noise of six different variances are shown in

Tables 8-1 through to 8-6. In all these tables, ∆X = absolute difference between ‘X ’

values as predicted by the criterion, and the optimal ‘X ’ value. The ‘X ’ value is defined

to be the quality measure ‘X ’ between the denoised and the true image, chosen here

to be L1 or the SSIM. dσX is the absolute difference between the σ value for NL-Means

predicted by the criterion and the σ value when the quality measure X was optimal.

The other quality measures are not shown here to save space. The last two rows of the

tables indicate the minimum and maximum of the optimal quality measure values across

all the 13 images on which the experiments were run (which gives an idea about the

range of the optimal ∆X values).

Some results on the ‘stream’ and ‘mandrill’ images are shown in Figure 8-2 and

8-3 with the corresponding residuals. Experiments were also performed on images

degraded with bounded uniform noise of total width 2× 5× 10−4 and 2× 5× 10−3 (on an

intensity range of [0,1]) with results shown in Tables 8-7 and 8-8.

For low and moderate noise levels, it was observed that the criteria P or K

produced errors an order of magnitude better than MI, η1 and η2 (which were the closest

competitors) and even two orders of magnitude better than CC. Our observation was

that CC and information theoretic criteria tend to cause undersmoothing for NL-Means.

At high noise levels, we saw that all criteria produced a high error in prediction of the

optimal parameter. An explanation for this is that the NL-Means algorithm by itself

does not produce very good results at high noise levels, and requires high σ values

208

which produce highly structured residuals. For low σ values, it produces residuals that

resemble noise in the sense of various criteria, but this leads to hugely undersmoothed

estimates.

An interesting phenomenon we observed was that the same KS-test based

measures (i.e. Pn and Kn) between the residuals and the actual noise samples (which

we know, as these are synthetic experiments) often did not perform as well as the

KS-test measures (i.e. P and K ) between pairs of patches from the residual. We

conjecture that this is owing to biases inherent in the NL-Means algorithm (as in many

others - see [148]) due to which the residuals have different means and variances as

compared to the actual noise, even though the residuals may be homogenous. We

checked experimentally that the variance of the residuals produced by NL-Means under

σ values optimal in an L1-sense was significantly different from the noise variance.

8.4.3 Effect of Patch Size on the KS Test

The KS test employed here operates on image patches. The patch-size can

be a crucial parameter: too low a patch size (say 2 × 2) will lead to reduction in the

discriminatory power of the KS test for this application and cause (false) rejection for

all filter parameters, whereas too high a patch size will lead to (false) acceptance for all

filter parameters. We chose a patch size so that the number of samples for estimation

of the cumulative was sufficient. This was determined in such a way that the maximum

absolute error between the estimated and true underlying CDFs was no more than 0.1

with a probability of 0.9. Then, using the Dvoretzky-Kiefer-Wolfowitz inequality, it follows

that there should be at least 149 samples [177], [178]. Hence we chose a patchsize of

12 × 12. However, we also performed an experiment with NL-Means where the KS-test

was performed across multiple scales from 12 to 60 in steps of 8 (for an image of size

256 × 256), and average h, P and K values were calculated. However for the several

experiments described in the previous sections, we just used the patchsize of 12 × 12,

as the multiscale measure did not produce significantly better results.

209

8.4.4 Results on Total Variation

Results for total variation diffusion with Gaussian noise of variance 5 × 10−4 and

5 × 10−3 are shown in Tables 8-9 and 8-10. For this method, the KS-based measures

performed well in terms of errors in predicting the correct number of iterations and the

correct quality measures, but not as well as MI within the restricted stopping time range.

The results were also compared to those obtained from [174] which performed the best,

though we would like to remind the reader that method from [174] requires knowledge of

the full noise distribution. Also, in the case of total variation, the KS-based measures did

not outperform MI. An explanation for this is that the total variation method is unable to

produce homogenous residuals for its optimal parameter set, as it is specifically tuned

for piecewise constant images. This assumption does not hold good for commonly

occurring natural images. As against this, NL-Means is a filter expressly derived from

the assumption that patches in ‘clean’ natural images (and those with low or moderate

noise) have several similar patches in distant parts of the image.

8.5 Discussion and Avenues for Future Work

In this chapter, we have contributed to and critiqued independence-based criteria for

filter parameter selection and presented a criterion that measures the homogeneity of

the residual statistics. On the whole, we have contributed to the paradigm of exploiting

statistical properties of the residual images for driving the denoising algorithm. The

proposed noiseness measures require no other assumptions except that (1) the noise

should be i.i.d. and additive, and that (2) a loose lower bound on the noise variance

is known to prevent false minima with extreme undersmoothing. Unlike CC or MI, the

KS-based noiseness measures are guaranteed not to be yield false minima in case of

oversmoothing.

The KS-based noiseness criteria require averaging of the P or K values from

different patches. For future work, this can be replaced by performing k-sample versions

of the Kolmogorov-Smirnov or related tests such as Cramer von-Mises [179] between

210

individual patches versus a pooled sample containing the entire residual image. This will

produce a single K or P value for the whole image.

The assumption of i.i.d. noise may not hold in some denoising scenarios. In case

of a zero-mean Gaussian noise model with intensity dependent variances, a heuristic

approach would be to normalize the residuals suitably using feedback from the denoised

intensity values (regarding them as the ‘true’ image values) and then running the

KS-tests. The efficacy of this approach needs to be tested on denoising algorithms that

are capable of handling intensity dependent noise. In case the noise obeys a Poisson

distribution (which is neither fully additive nor multiplicative), there are two ways to

proceed: either apply a variance stabilizer transformation [180] which converts the data

into that corrupted by Gaussian noise with variance of one, or else suitably change the

definition of the residual itself.

Moreover, the existence of a universally optimal parameter selector is not yet

established: different criteria may perform better or worse for different denoising

algorithms or with different assumptions on the noise model. This is, as per our

survey of the literature, an unsolved problem in image processing. Lastly, despite

encouraging experimental results, there is no established theoretical relationship

between the performance of noiseness criteria for filter parameter selection and the

‘ideal’ parameters in terms of image quality criteria like MSE. A detailed study of

risk-based criteria such as those in [113] may be important in this context.

211

0 2000 4000 6000 8000 10000 12000 14000 160000

10

20

30

40

50

60

70

80

90

CC

MI

P

MSE

Figure 8-1. Plots of CC, MI, P and MSE on an image subjected to upto 16000 iterationsof total variation denoising

Table 8-1. (NL-Means) Gaussian noise σ2n = 0.0001- ∆L1 dσL1 ∆SSIM dσSSIMh 0.088 10.462 0.002 7.692P 0.031 7.538 0.005 12.462K 0.040 7.846 0.004 10.308CC 0.085 9.846 0.010 17.846MI 0.189 17.077 0.011 18.615η1 0.189 17.077 0.011 18.615η2 0.176 18 0.011 20.769

Local MI 0.055 8.769 0.007 16.154hNM 0.851 31.462 0.009 16.385PNM 0.087 13.231 0.002 4.923KNM 0.215 19.385 0.001 5.538Min 2.225 - 0.888 -Max 9.147 - 0.986 -

212

A B C D

E F G H

I J K L

M N

Figure 8-2. Images with Gaussian noise with σ2n = 5× 10−3 denoised by NL-Means.Parameter selected for optimal noiseness measures: (a): P, (c) K , (e) CC ,(g) MI; and optimal quality measures: (i) L1, (k) SSIM, (m) MI betweendenoised image and residual. Corresponding residuals in(b),(d),(f),(h),(j),(l),(n); Zoom in pdf file for better view

213

A B C D

E F G H

I J K L

M N

Figure 8-3. Images with Gaussian noise with σ2n = 5× 10−3 denoised by NL-Means.Parameter selected for optimal noiseness measures: (a): P, (c) K , (e) CC ,(g) MI; and optimal quality measures: (i) L1, (k) SSIM, (m) MI betweendenoised image and residual. Corresponding residuals in(b),(d),(f),(h),(j),(l),(n); Zoom in pdf file for better view

214

Table 8-2. (NL-Means) Gaussian noise σ2n = 0.0005- ∆L1 dσL1 ∆SSIM dσSSIMh 0.029 7.692 0.003 12.769P 0.010 4.615 0.006 17.385K 0.014 5.385 0.004 15.077CC 0.068 8.462 0.007 15.077MI 0.087 14.615 0.009 19.692η1 0.087 14.615 0.009 19.692η2 0.232 19.923 0.015 26.538

Local MI 0.155 14.615 0.005 13.538hn 0.436 28.154 0.003 8Pn 0.047 10.769 0.003 9.692Kn 0.128 16.154 0.001 5.846Min 2.683 - 0.884 -Max 9.383 - 0.981 -

Table 8-3. (NL-Means) Gaussian noise σ2n = 0.001- ∆L1 dσL1 ∆SSIM dσSSIMh 0.041 9.231 0.004 16.154P 0.024 5.385 0.005 16.923K 0.035 7.692 0.004 14.615CC 0.151 16.154 0.004 12.308MI 0.126 19.231 0.008 18.462η1 0.126 19.231 0.008 18.462η2 0.157 26.923 0.013 29.231

Local MI 0.191 20.308 0.003 10.308hn 0.218 22.308 0.001 6.154Pn 0.041 10 0.002 12.308Kn 0.100 15.385 0.001 6.923Min 3.069 - 0.879 -Max 9.601 - 0.976 -

215

Table 8-4. (NL-Means) Gaussian noise σ2n = 0.005- ∆L1 dσL1 ∆SSIM dσSSIMh 0.206 33.846 0.003 24.615P 0.207 33.846 0.003 24.615K 0.488 43 0.005 33.769CC 2.253 92.308 0.034 55.385MI 2.677 79.538 0.054 67.231η1 2.720 81.077 0.054 68.769η2 2.119 105.231 0.053 105.231

Local MI 3.889 107.846 0.069 74hn 1.337 38 0.032 38Pn 1.335 33.538 0.033 36.615Kn 1.336 33.538 0.034 42.769Min 4.838 - 0.791 -Max 10.695 - 0.955 -

Table 8-5. (NL-Means) Gaussian noise σ2n = 0.01- ∆L1 dσL1 ∆SSIM dσSSIMh 12.121 226.692 0.202 177.462P 12.121 226.692 0.202 177.462K 12.121 226.692 0.202 177.462CC 8.886 207.154 0.149 157.923MI 11.701 224.462 0.195 175.231η1 11.701 224.462 0.195 175.231η2 6.218 200 0.119 163.077

Local MI 12.121 226.692 0.202 177.462hn 1.891 60.923 0.045 64Pn 3.642 86.462 0.081 108Kn 3.649 88 0.082 115.692Min 6.285 - 0.735 -Max 11.661 - 0.933 -

216

Table 8-6. (NL-Means) Gaussian noise σ2n = 0.05- ∆L1 dσL1 ∆SSIM dσSSIMh 14.704 906.154 0.183 643.846P 11.290 838.462 0.140 576.154K 9.597 805.385 0.118 543.077CC 18.959 1020 0.249 757.692MI 19.435 1026.154 0.253 763.846η1 19.550 1027.692 0.255 765.385η2 19.435 1026.154 0.253 763.846

Local MI 19.783 1030.769 0.258 768.462hn 24.516 1028.462 0.305 796.923Pn 26.721 1120.769 0.325 858.462Kn 26.721 1120.769 0.325 858.462Min 11.748 - 0.555 -Max 17.478 - 0.806 -

Table 8-7. (NL-Means) Uniform noise width = 0.001- ∆L1 dσL1 ∆SSIM dσSSIMh 0.055 9.692 0.003 10.923P 0.013 5.692 0.006 14.923K 0.021 6.615 0.005 14.000

CC 0.087 11.077 0.009 17.231MI 0.188 13.385 0.007 17.077η1 0.188 13.385 0.007 17.077η2 0.267 16.846 0.009 19.000

Local MI 0.244 17.077 0.011 21.692hn 0.770 30.231 0.008 14.538Pn 0.054 10.308 0.003 7.231Kn 0.114 14.462 0.001 4.615Min 2.339 - 0.887 - 1.195 -Max 9.176 - 0.985 - 2.011 -

217

Table 8-8. (NL-Means) Uniform noise width = 0.01- ∆L1 dσL1 ∆SSIM dσSSIMh 0.034 10.769 0.005 18.462P 0.027 9.231 0.005 20.000K 0.042 12.308 0.005 16.923

CC 0.430 34.615 0.004 14.615MI 0.137 16.923 0.007 27.692η1 0.137 16.923 0.007 27.692η2 0.125 21.538 0.011 35.385

Local MI 0.477 36.308 0.004 16.308hn 0.025 9.231 0.004 20.000Pn 0.020 9.231 0.006 26.154Kn 0.025 10.769 0.006 27.692Min 3.522 - 0.860 - 1.148 -Max 9.835 - 0.970 - 1.834 -

Table 8-9. (TV) Gaussian noise σ2n = 0.0005- ∆L1 dtL1 ∆SSIM dtSSIMh 0.558 53.462 0.006 56.538P 0.522 48.462 0.006 52.308K 0.513 46.538 0.006 50.385

CC 3.487 365.000 0.088 371.923MI 0.103 20.769 0.001 23.077η1 0.103 20.769 0.001 23.077η2 2.478 267.692 0.062 274.615

Local MI 0.479 36.923 0.005 32.308hn 0.538 69.615 0.007 76.538Pn 0.523 68.846 0.007 75.769Kn 0.528 69.615 0.007 76.538

Gilboa et al. [174] 0.050 10.231 0.001 16.385Min 2.622 - 0.975 -Max 4.426 - 0.995 -

218

Table 8-10. (TV) Gaussian noise σ2n = 0.005- ∆L1 dtL1 ∆SSIM dtSSIMh 0.665 129.615 0.008 102.692P 0.493 109.615 0.006 80.385K 0.430 102.308 0.006 73.077

CC 2.156 350.769 0.073 376.923MI 0.422 88.846 0.012 118.077η1 0.422 88.846 0.012 118.077η2 1.849 296.923 0.063 331.538

Local MI 5.475 270.769 0.084 240.769hn 0.194 59.615 0.008 96.538Pn 0.221 74.231 0.008 115.000Kn 0.216 75.769 0.008 116.538

Gilboa et al. [174] 0.094 60.000 0.002 42.000Min 4.995 - 0.892 -Max 11.284 - 0.980 -

219

CHAPTER 9CONCLUSION AND FUTURE WORK

9.1 List of Contributions

We have presented contributions to two major problems fundamental to image

processing: probability density estimation and image denoising. The contributions to

probability density estimation are as follows:

1. Development of a new PDF estimator for images which accounts for the fact thatthe image is not just a bunch of samples, but a discrete version of an underlyingcontinuous signal.

2. Extension of the above concept for joint PDFs of two or more images, defined on2D or 3D domains.

3. Extension of the above concepts to develop three different biased densityestimators that favor the higher gradient regions or points of a single image (in2D/3D), a pair of images (in 2D/3D) or a triple of images (in 3D).

4. Application of all the above PDF estimators to affine image registration.

5. Application of all unbiased PDF estimators to filtering of grayscale and colorimages, chromaticity fields and grayscale video, in a mean-shift framework.

6. Development of density estimators for unit-vector data such as chromaticity andhue in color images by making explicit use of the fact that they are obtained astransformation of color measurements that can be assumed to lie in Euclideanspace.

The contributions to image denoising are as follows:

1. We have developed a non-local image denoising algorithm (NL-SVD) after aseries of experiments on the patch SVD. Our technique learns SVD bases for anensemble of patches that are similar to a reference patch located at each pixel.These spatially adaptive bases are shown to produce excellent performance onimage denoising, comparable to the state of the art.

2. Our method has parameters which are obtained in a principled manner from thenoise model. The method is thus elegant and efficient as it does not need anycomplicated optimization procedure.

3. We have extended the NL-SVD technique to perform joint filtering of imagepatches, leading to the HOSVD based filtering technique that yields even betterimage quality values.

220

4. We have also presented a new statistical criterion for automated filter parameterselection and used it to obtain the smoothing parameter in the NLMeans algorithmwithout reference to the true image.

9.2 Future Work

Future work on the probability density estimator has been outlined in Section 3.4.

Here, we leave behind pointers to possible future extensions of our work in image

denoising.

9.2.1 Trying to Reach the Oracle

The ultimate aim of several of the procedures reported in Chapter 7 was to obtain

the SVD bases of the underlying patch. The bases obtained by NL-SVD and HOSVD

yield excellent performance but are still far behind the oracle denoiser. Is it possible

to obtain the true bases or bases that are very close to the true bases? Are there

other bases that would yield equivalent performance? These questions remain open

problems.

9.2.2 Blind and Non-blind Denoising

In many contemporary denoising algorithms [2], [146], [134], one assumes

knowledge of the true noise variance as this allows principled selection of various

parameters. However, the noise variance is often not known in practice and this is called

as a ‘blind denoising scenario’. In such cases, one can use knowledge about the sensor

device in getting an idea of the noise variance. However, environmental factors too can

affect image quality, and in such cases, one cannot merely use sensor properties. In

practice, the noise variance can be estimated from the noisy data available. One of

the most commonly used techniques for noise variance estimation computes the Haar

wavelet transform of the image. The maximum absolute deviation of the HH sub-band

(high frequency components in both x and y directions) is considered to be a reasonable

estimate of the noise variance [181]. Three training-based methods are presented in

[182]: two which make use of a Laplacian prior for natural images, and another which

measures noise variance from the variance of homogenous regions in a noisy image.

221

A statistical criterion for distinguishing between homogenous regions and regions with

edges/oriented texture is presented in [183]. Development of a robust noise variance

estimator and using it in conjunction with the denoising method presented in this thesis,

is an interesting direction for future work. Furthermore, one can also side-step the

problem of noise variance estimation as follows: our denoising algorithm can be run

assuming several different values for the noise standard deviation σ. This affects the

critical parameters for transform-domain thresholding or measurement of similarity

between patches. After denoising, one can compute one of the noiseness measures

discussed in the previous chapters and select the σ value that produced the ‘noisiest’

residual.

9.2.3 Challenging Denoising Scenarios

Our denoising algorithm has been tested thoroughly on (i.i.d. and additive)

zero-mean Gaussian noise at different values of σ. Most contemporary algorithms

from the literature have also been tested only on Gaussian noise. This model is known

to hold true for thermal noise and also for film grain noise under some conditions [184].

However, there exist several other noise models such as the negative exponential model

which affects images acquired through synthetic aperture radar, Poisson noise which

is a valid model for images acquired with cameras having low shutter speed or under

poor illumination, or speckle noise in ultrasound [184]. The patch similarity measure,

the relative behavior of the true signal and the noise instances in the transform domain,

and the choice of norms or energy criteria to optimize for suitable denoising bases, are

all affected by the assumed noise model. In the case of distributions like Poisson which

are not really additive, characterization of the behavior of the noise instances in the

transform domain poses a difficult problem. To complicate matters further, the noise

affecting the image may be intensity dependent or drawn from noise distributions that

are spatially varying. In fact, the Poisson model is one such, the noise induced by lossy

compression algorithms is another. All these problems present rich avenues for future

222

research. Ultimately, actual camera noise is the cumulative effect of several factors:

shutter speed, ambient illumination, stability of the camera taking the picture, motion

of the objects in the scene, the behavior of the electronic circuitry inside the camera,

and the lossy compression algorithm to store the images. A careful study of all these

factors and the interplay between them is an important open problem in practical image

processing.

223

APPENDIX ADERIVATION OF MARGINAL DENSITY

In this section, we derive the expression for the marginal density of the intensity of a

single 2D image. We begin with Eq. (2–27) derived in Section 2.2.1:

p(α) =1

A

∫I (x ,y)=α

∣∣∣∣∣∣∣∂x∂I

∂y∂I

∂x∂u

∂y∂u

∣∣∣∣∣∣∣ du. (A–1)

Consider the following two expressions that appear while performing a change of

variables and applying the chain rule:

[dx dy

]= [ dI du ]

∂x∂I

∂y∂I

∂x∂u

∂y∂u

. (A–2)

[dI du

]= [ dx dy ]

∂I∂x

∂u∂x

∂I∂y

∂u∂y

= [ dx dy ] Ix uxIy uy

. (A–3)

Taking the inverse in the latter, we have

[dx dy

]=

1

Ixuy − Iyux

uy −ux

−Iy Ix

[ dI du ]. (A–4)

Comparing the individual matrix coefficients, we obtain

∣∣∣∣∣∣∣∂x∂I

∂y∂I

∂x∂u

∂y∂u

∣∣∣∣∣∣∣ =Ixuy − ux Iy(Ixuy − Iyux)2

=1

Ixuy − Iyux. (A–5)

Now, clearly the unit vector ~u is perpendicular to ~I , i.e. we have the following:

uy =Ix√I 2x + I

2y

, and (A–6)

ux =−Iy√I 2x + I

2y

. (A–7)

224

This finally gives us

∣∣∣∣∣∣∣∂x∂I

∂y∂I

∂x∂u

∂y∂u

∣∣∣∣∣∣∣ =1√I 2x + I

2y

. (A–8)

Hence we arrive at the following expression for the marginal density:

p(α) =1

A

∫I (x ,y)=α

du√I 2x + I

2y

. (A–9)

This is the same expression as in Eq. (2–28).

225

APPENDIX BTHEOREM ON THE PRODUCT OF A CHAIN OF STOCHASTIC MATRICES

The specific theorem from [133] on the product of a chain of stochastic matrices is

produced here (verbatim) for completeness:

Let Ω be an arbitrary set and let for each ω ∈ Ω,

Pω =

pω11 ... pω1N

. ... .

. ... .

. ... .

pωN1 ... pωNN

. (B–1)

be a row-stochastic matrix, i.e. a matrix with∑j p

ωij = 1 and pωij ≥ 0 for all (i , j). Then

suppose all matrices Pω satisfy the condition that there exists a constant c > 0 such that∑j c

ωjmin ≥ c where cωjmin denotes the minimum value of the elements in the j th column

of Pω. Let ω = ω1,ω2, ... be an arbitrary sequence of elements from Ω. Then the limit

M ω = limn→∞ PωnPωn−1...Pω1 exists and is a matrix with identical rows given as

Mω =

µω1 ... µωN

. ... .

µω1 ... µωN

. (B–2)

Moreover if M ωn = limn→∞ P

ωnPωn−1...Pω1, then for any i ,

1

2

N∑j=1

|M ωn (i , j)− µω

j | ≤ (1− c)n, n ≥ 0. (B–3)

for some probability vector µω1,µω2, ...,µωN. The convergence rate is thus upper

bounded by (1− c)n.

226

REFERENCES

[1] Tina Is No Acronym (TINA) Image Database, Available from http://www.tina-vision.net/ilib.php, 2008, University of Manchester and University of Sheffield, UK.

[2] A. Buades, B. Coll, and J.-M. Morel, “A review of image denoising algorithms, witha new one,” Multiscale modelling and simulation, vol. 4, no. 2, pp. 490–530, 2005.

[3] B. Silverman, Density Estimation for Statistics and Data Analysis. London, UK:Chapman and Hall, 1986.

[4] J. Simonoff, Smoothing Methods in Statistics. Berlin,Germany: Springer Verlag,1996.

[5] C. Bishop, Pattern Recognition and Machine Learning. Springer Verlag, 2006.

[6] D. Herrick, G. Nason, and B. Silverman, “Some new methods for wavelet densityestimation,” Sankhya, vol. 63, pp. 394–411, 2001.

[7] A. Peter and A. Rangarajan, “Maximum likelihood wavelet density estimation withapplications to image and shape matching,” IEEE Trans. Image Process., vol. 17,no. 4, pp. 458–468, April 2008.

[8] D. Donoho, I. Johnstone, G. Kerkyacharian, and D. Picard, “Density estimation bywavelet thresholding,” Ann. Stat., vol. 24, pp. 508–539, 1996.

[9] A. Rajwade, A. Banerjee, and A. Rangarajan, “New method of probability densityestimation with application to mutual information based image registration,”in IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, 2006, pp.1769–1776.

[10] ——, “Continuous image repesentations avoid the histogram binning problemin mutual information based image registration,” in IEEE Int. Symp. BiomedicalImaging, 2006, pp. 840–843.

[11] ——, “Probability density estimation using isocontours and isosurfaces:applications to information-theoretic image registration,” IEEE Trans. PatternAnal. Mach. Intell., vol. 31, no. 3, pp. 475–491, 2009.

[12] T. Kadir and M. Brady, “Estimating statistics in arbitrary regions of interest,” inBritish Mach. Vision Conf., 2005, pp. 589–598.

[13] N. Joshi and M. Brady, “Nonparametric mixture model based evolution of levelsets,” in Int. Conf. Computing: Theory and Applications (ICCTA), 2007, pp.618–622.

[14] E. Hadjidemetriou, M. Grossberg, and S. Nayar, “Histogram preserving imagetransformations,” Int. J. Comput. Vis., vol. 45, no. 1, pp. 5–23, 2001.

227

[15] J. Boes and C. Meyer, “Multi-variate mutual information for registration,” in Med.Image Comput. Computer-Assisted Intervention, ser. LNCS, vol. 1679. Springer,1999, pp. 606–612.

[16] J. Zhang and A. Rangarajan, “Multimodality image registration using an extensibleinformation metric,” in Inf. Process. Med. Img., ser. LNCS, vol. 3565. Springer,2005, pp. 725–737.

[17] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, ComputationalGeometry: Algorithms and Applications. Berlin, Germany: Springer Verlag, 1997.

[18] L. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removalalgorithms,” Physica D, vol. 60, pp. 259–268, 1992.

[19] D. L. Collins et al., “Design and construction of a realistic digital brain phantom,”IEEE Trans. Med. Imag., vol. 17, no. 3, pp. 463–468, 1998.

[20] J. Pluim, J. Maintz, and M. Viergever, “Mutual information based registration ofmedical images: A survey,” IEEE Trans. Med. Imag., vol. 22, no. 8, pp. 986–1004,2003.

[21] H. Chen, M. Arora, and P. Varshney, “Mutual information-based image registrationfor remote sensing data,” J. Remote Sensing, vol. 24, no. 18, pp. 3701–3706,2003.

[22] P. Viola and W. Wells, “Alignment by maximization of mutual information,” Int. J.Comput. Vis., vol. 24, no. 2, pp. 137–154, 1997.

[23] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens,“Multimodality image registration by maximization of mutual information,” IEEETrans. Med. Imag., vol. 16, no. 2, pp. 187–198, 1997.

[24] F. Maes, D. Vandermeulen, and P. Suetens, “Medical image registration usingmutual information,” Proc. IEEE, vol. 91, no. 10, pp. 1699–1722, 2003.

[25] M. Rao, Y. Chen, B. Vemuri, and F. Wang, “Cumulative residual entropy: A newmeasure of information,” IEEE Trans. Inf. Theory, vol. 50, no. 6, pp. 1220–1228,2004.

[26] F. Wang and B. Vemuri, “Non-rigid multi-modal image registration usingcross-cumulative residual entropy,” Int. J. Comput. Vis., vol. 74, no. 2, pp.201–215, 2007.

[27] F. Wang, B. Vemuri, M. Rao, and Y. Chen, “Cumulative residual entropy, a newmeasure of information and its application to image alignment,” in IEEE Int. Conf.Computer Vision, 2003, pp. 548–553.

228

[28] J. Beirlant, E. Dudewicz, L. Gyorfi, and E. C. van der Meulen, “Nonparametricentropy estimation: An overview,” Int. J. Math. Stat. Sci., vol. 6, no. 1, pp. 17–39,June 1997.

[29] P. Viola, “Alignment by maximization of mutual information,” Ph.D. dissertation,Massachussets Institute of Technology, 1995.

[30] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis, “Improved fast Gausstransform and efficient kernel density estimation,” in IEEE Int. Conf. ComputerVision, vol. 1, 2003, pp. 464–471.

[31] M. Leventon and W. Grimson, “Multi-modal volume registration using joint intensitydistributions,” in Med. Image Comput. Computer-Assisted Intervention, vol. 1496,1998, pp. 1057–1066.

[32] T. Downie and B. Silverman, “A wavelet mixture approach to the estimation ofimage deformation functions,” Sankhya Series B, vol. 63, pp. 181–198, 2001.

[33] B. Ma, A. Hero, J. Gorman, and O. Michel, “Image registration with minimumspanning tree algorithm,” in IEEE Int. Conf. Image Process., vol. 1, 2000, pp.481–484.

[34] J. Costa and A. Hero, “Entropic graphs for manifold learning,” in IEEE AsilomarConf. Sign., Sys. and Comp., vol. 1, 2003, pp. 316–320.

[35] M. Sabuncu and P. Ramadge, “Gradient based optimization of an EMST imageregistration function,” in IEEE Int. Conf. Acoust., Speech, Sig. Proc., vol. 2, 2005,pp. 253–256.

[36] N. Dowson, R. Bowden, and T. Kadir, “Image template matching using mutualinformation and NP-Windows,” in Int. Conf. Pattern Recognition, vol. 2, 2006, pp.1186–1191.

[37] B. Karacali, “Information theoretic deformable registration using local imageinformation,” Int. J. Comput. Vis., vol. 72, no. 3, pp. 219–237, 2007.

[38] P. Thevenaz and M. Unser, “Optimization of mutual information for multiresolutionimage registration,” IEEE Trans. Image Process., vol. 9, no. 12, pp. 2083–2099,2000.

[39] T. Cover and J. Thomas, Elements of Information Theory. New York, USA: WileyInterscience, 1991.

[40] J. Zhang and A. Rangarajan, “Affine image registration using a new informationmetric,” in IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2004, pp.848–855.

[41] W. Feller, “On the Kolmogorov-Smirnov limit theorems for empirical distributions,”The Annals of Mathematical Statistics, vol. 19, no. 2, pp. 177–189, 1948.

229

[42] R. Shekhar and V. Zagrodsky, “Mutual information-based rigid and nonrigidregistration of ultrasound volumes,” IEEE Trans. Med. Imag., vol. 21, no. 1, pp.9–22, 2002.

[43] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recogni-tion. Berlin, Germany: Springer Verlag, 1996.

[44] P. Perona and J. Malik, “Scale-space and edge detection using anisotropicdiffusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 7, pp. 629–639,1990.

[45] D. Tschumperle and R. Deriche, “Vector-valued image regularization with PDEs :A common framework for different applications,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 27, no. 4, pp. 506–517, 2005.

[46] B. Tang and G. Sapiro, “Color image enhancement via chromaticity diffusion,”IEEE Trans. Image Process., vol. 10, pp. 701–707, 1999.

[47] P. Saint-Marc, J. Chen, and G. Medioni, “Adaptive smoothing: a general tool forearly vision,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 6, pp. 514–520,1991.

[48] K. Plataniotis and A. Venetsanopoulos, Color image processing and applications.New York, USA: Springer Verlag, 2000.

[49] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in IEEEInt. Conf. Computer Vision, 1998, pp. 839–846.

[50] Y. Cheng, “Mean shift, mode seeking and clustering,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 17, no. 8, pp. 790–799, 1995.

[51] D. Comaniciu and P. Meer, “Mean shift: a robust approach toward feature spaceanalysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 603–619,2002.

[52] A. Rajwade, A. Banerjee, and A. Rangarajan, “Image filtering driven by levelcurves,” in Int. Conf. Energy Min. Methods Computer Vision Pattern Recognition,2009, pp. 359–372.

[53] T. Chan and J. Shen, Image Processing and Analysis: Variational, PDE, wavelets,and stochastic methods. SIAM, 2005.

[54] D. Barash and D. Comaniciu, “A common framework for nonlinear diffusion,adaptive smoothing, bilateral filtering and mean shift,” Image Vis. Comput., vol. 22,pp. 73–81, 2004.

[55] A. Buades, B. Coll, and J.-M. Morel, “Neighborhood filters and PDEs,” NumerischeMathematik, vol. 105, no. 1, pp. 1–34, 2006.

230

[56] R. Subbarao and P. Meer, “Discontinuity preserving filtering over analyticmanifolds,” in IEEE Conf. Computer Vision and Pattern Recognition, 2007, pp.1–6.

[57] J. van de Weijer and R. van den Bloomgard, “Local mode filtering,” in IEEE Conf.Computer Vision and Pattern Recognition, vol. 2, 2001, pp. 428–436.

[58] N. Sochen, R. Kimmel, and R. Malladi, “A general framework for low level vision,”IEEE Trans. Image Process., vol. 7, pp. 310–318, 1998.

[59] D. Comaniciu, “An algorithm for data-driven bandwidth selection,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 25, pp. 281–288, 2003.

[60] D. Comaniciu, V. Ramesh, and P. Meer, “The variable bandwidth mean shiftand data-driven scale selection,” in IEEE Int. Conf. Computer Vision, 2001, pp.438–445.

[61] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmentednatural images and its application to evaluating segmentation algorithms andmeasuring ecological statistics,” in IEEE Int. Conf. Computer Vision, vol. 2, 2001,pp. 416–423.

[62] S. Lansel, “About DenoiseLab,” Available from http://www.stanford.edu/∼slansel/DenoiseLab/documentation.htm, 2006.

[63] Z. Wang, E. Simoncelli, and A. Bovik, “Multi-scale structural similarity for imagequality assessment,” in IEEE Asilomar Conf. Signals, Sys. Comp., 2003, pp.1398–1402.

[64] O. Subakan, J. Bing, B. Vemuri, and E. Vallejos, “Feature preserving imagesmoothing using a continuous mixture of tensors,” in IEEE Int. Conf. ComputerVision, 2007, pp. 1–6.

[65] H. Takeda, S. Farsiu, and P. Milanfar, “Kernel regression for image processing andreconstruction,” IEEE Trans. Image Process., vol. 16, no. 2, pp. 349–366, 2007.

[66] K. Mardia and P. Jupp, Directional Statistics. Chichester, UK: Wiley Interscience,2000.

[67] P. Kim and J. Koo, “Directional mixture models and optimal estimation of themixing density,” Can. J. Stat., pp. 383–398, 1998.

[68] A. Banerjee, I. Dhillon, J. Ghosh, and S. Sra, “Clustering on the unit hypersphereusing von Mises-Fisher distributions,” J. Mach. Learning Res., vol. 6, pp.1345–1382, 2005.

[69] T. McGraw, B. Vemuri, R. Yezierski, and T. Mareci, “Von Mises-Fisher mixturemodel of the diffusion ODF,” in IEEE Int. Symp. Biomedical Imaging, 2006, pp.65–68.

231

[70] A. Prati, S. Calderara, and R. Cucchiara, “Using circular statistics for trajectoryshape analysis,” in IEEE Conf. Computer Vision and Pattern Recognition, June2008, pp. 1–6.

[71] K. Hara, K. Nishino, and K. Ikeuchi, “Multiple light sources and reflectanceproperty estimation based on a mixture of spherical distributions,” IEEE Int. Conf.Computer Vision, vol. 2, pp. 1627–1634, Oct. 2005.

[72] C. Han, B. Sun, R. Ramamoorthi, and E. Grinspun, “Frequency domain normalmap filtering,” ACM Trans. Graphics, vol. 26, no. 3, pp. 28–37, 2007.

[73] O. Eugeciouglu and A. Srinivasan, “Efficient nonparametric density estimationon the sphere with applications in fluid mechanics,” SIAM Journal on ScientificComputing, vol. 22, no. 1, pp. 152–176, 2000.

[74] A. Papoulis, Probability, Random Variables and Stochastic Processes. McGrawHill, 1984.

[75] H. Schaeben, “Normal orientation distributions,” Textures and Microstructures,vol. 19, pp. 197–202, 1992.

[76] A. Bijral, M. Breitenbach, and G. Grudic, “Mixture of Watson distributions: Agenerative model for hyperspherical embeddings,” in AI and Statistics, 2007, pp.1–8.

[77] T. Downs and A. L. Gould, “Some relationships between the normal and vonMises distributions,” Biometrika, vol. 54, no. 3, pp. 684–687, 1967.

[78] B. Presnell, S. Morrison, and R. Littell, “Projected multivariate linear models fordirectional data,” J. Am. Stat. Assoc., vol. 93, no. 443, pp. 1068–1077, 1998.

[79] T. Gevers and H. Stokman, “Robust histogram construction from color invariantsfor object recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 10, pp.113–118, 2003.

[80] B. Pelletier, “Kernel density estimation on Riemannian manifolds,” Stat. Prob.Letters, vol. 73, pp. 297–304, 2005.

[81] D. Donoho and T. Weissman, “Recent trends in denoising tutorial, ISIT 2007,”Available from http://www.stanford.edu/∼slansel/tutorial/summary.htm, 2007.

[82] B. M. ter Haar Romeny, Geometry-driven diffusion in computer vision. Utrecht,Netherlands: Kluwer, 1994.

[83] J. Weickert, Anisotropic Diffusion in Image Processing. Stuttgart, Germany:Teubner, 1998.

[84] M. Black, G. Sapiro, D. Marimont, and D. Heeger, “Robust anisotropic diffusion,”IEEE Trans. Image Process., vol. 7, no. 3, pp. 421–432, 1998.

232

[85] F. Catte, P. Lions, J. Morel, and T. Coll, “Image selective smoothing and edgedetection by nonlinear diffusion,” SIAM J. Numer. Anal., vol. 29, no. 1, pp.182–193, 1992.

[86] L. Rudin and S. Osher, “Total variation based image resoration with free localconstraints,” in IEEE Int. Conf. Image Process., 1994, pp. 31–35.

[87] T. Le, R. Chartrand, and T. Asaki, “A variational approach to reconstructingimages corrupted by Poisson noise,” J. Math. Imag. Vis., vol. 27, pp. 257–263,2007.

[88] G. Gilboa, N. Sochen, and Y. Zeevi, “Image enhancement and denoising bycomplex diffusion processes,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,no. 8, pp. 1020–1036, 2004.

[89] D. Seo and B. Vemuri, “Complex diffusion on scalar and vector valued imagegraphs,” in Int. Conf. Energy Min. Methods Comput. Vision Pattern Recognition,2009, pp. 98–111.

[90] Y. You and M. Kaveh, “Fourth order partial differential equations for noise removal,”IEEE Trans. Image Process., vol. 9, no. 10, pp. 1723–1730, 2000.

[91] M. Hajiaboli, “An anisotropic fourth-order partial differential equation for noiseremoval,” in Scale Space and Variational Methods in Computer Vision, 2009, pp.356–367.

[92] P. Mrazek, “Monotonicity enhancing nonlinear diffusion,” J. Visual Commun. ImageRepresentation, vol. 13, no. 1, pp. 313–323, 2000.

[93] A. Savitzky and M. Golay, “Smoothing and differentiation of data by simplified leastsquares procedures,” Anal. Chem., vol. 36, no. 8, pp. 1627–1639, 1964.

[94] W. Press, A. Teukolsky, W. Vetterling, and B. Flannery, Numerical recipes in C(2nd ed.): the art of scientific computing. New York, NY, USA: CambridgeUniversity Press, 1992.

[95] J. Fan and I. Gijbels, Local polynomial modeling and its application. London, UK:Chapman and Hill, 1996.

[96] S. M. Smith and J. M. Brady, “SUSAN - a new approach to low level imageprocessing,” Int. J. Comput. Vis., vol. 23, pp. 45–78, 1995.

[97] V. Katkovnik, A. Foi, K. Egiazarian, and J. Astola, “Directional varying scaleapproximations for anisotropic signal processing,” in Eur. Signal Process. Conf.,2004, pp. 1–6.

[98] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a densityfunction, with applications in pattern recognition,” IEEE Trans. Inf. Theory, vol. 21,no. 1, pp. 32–40, 1975.

233

[99] M. Elad, “On the origin of the bilateral filter and ways to improve it,” IEEE Trans.Image Process., vol. 11, no. 10, pp. 1141–1151, 2002.

[100] N. Sochen, R. Kimmel, and A. Bruckstein, “Diffusions and confusions in signal andimage processing,” J. Math. Imag. Vis., vol. 14, pp. 195–209, 2001.

[101] O. Subakan, “Continuous mixture models for feature preserving smoothing andsegmentation,” Ph.D. dissertation, University of Florida, 2009.

[102] B. Jian, B. Vemuri, E. Ozarslan, P. Carney, and T. Mareci, “A novel tensordistribution model for the diffusion-weighted MR signal,” Neuroimage, vol. 37,no. 1, pp. 164–176, 2007.

[103] D. Tschumperle, “Fast anisotropic smoothing of multi-valued images usingcurvature-preserving PDEs,” Int. J. Comput. Vis., vol. 68, no. 1, pp. 65–82, 2006.

[104] R. Frankot and R. Chellappa, “A method for enforcing integrability in shape fromshading algorithms,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 10, no. 4, pp.439–451, 1988.

[105] A. Agrawal and R. Raskar, “Short course (ICCV 2007): Gradient domainmanipulation techniques in vision and graphics,” http://www.umiacs.umd.edu/∼aagrawal/ICCV2007Course/index.html, 2007.

[106] H. Wang, Y. Chen, T. Fang, J. Tyan, and N. Ahuja, “Gradient adaptive imagerestoration and enhancement,” in Int. Conf. Image Proc., 2006, pp. 2893–2896.

[107] P. Hancock, R. Baddeley, and L. Smith, “The principal components of naturalimages,” Network: Computation in Neural Systems, vol. 3, pp. 61–72, 1992.

[108] R. Coifman and D. Donoho, “Translation-invariant denoising,” Yale University, Tech.Rep., 1995.

[109] L. Yaroslavsky, K. Egiazarian, and J. Astola, “Transform domain image restorationmethods: review, comparison and interpretation,” in SPIE Proceedings Series,Nonlinear Processing and Pattern Analysis, 2001, pp. 1–15.

[110] W. Freeman and E. Adelson, “The design and use of steerable filters,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 13, no. 9, pp. 891–906, 1991.

[111] J.-L. Starck, E. Candes, and D. Donoho, “The curvelet transform for imagedenoising,” IEEE Trans. Image Process., vol. 11, no. 6, pp. 670–684, 2002.

[112] J. Fan and R. Li, “Variable selection via nonconcave penalized likelihood and itsoracle properties,” J. Am. Stat. Assoc., vol. 96, no. 456, pp. 1348–1360, December2001.

[113] D. Donoho and I. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,”Biometrika, vol. 81, pp. 425–455, 1993.

234

[114] J. Mairal, G. Sapiro, and M. Elad, “Learning multiscale sparse representations forimage and video restoration,” Multiscale Modeling and Simulation, vol. 7, no. 1, pp.214–241, 2008.

[115] L. Sendur and I. Selesnick, “Bivariate shrinkage functions for wavelet-baseddenoising exploiting interscale dependency,” IEEE Trans. Signal Process., vol. 50,no. 11, pp. 2744–2756, 2002.

[116] E. Simoncelli, “Bayesian denoising of visual images in the wavelet domain,” inLecture Notes in Statistics, vol. 141, 1999, pp. 291–308.

[117] J. Portilla, V. Strela, M. Wainwright, and E. Simoncelli, “Image denoising usingscale mixtures of Gaussians in the wavelet domain,” IEEE Trans. Image Process.,vol. 12, no. 11, pp. 1338–1351, 2003.

[118] A. Hyvarinen, P. Hoyer, and E. Oja, “Image denoising by sparse code shrinkage,”in Intelligent Signal Processing, 1999, pp. 1–6.

[119] J. Huang and D. Mumford, “Statistics of natural images and models,” IEEE Conf.Computer Vision and Pattern Recognition, vol. 1, pp. 1541–1548, 1999.

[120] Y. Hel-Or and D. Shaked, “A discriminative approach for wavelet denoising,” IEEETrans. Image Process., vol. 17, no. 4, pp. 443–457, 2008.

[121] A. Buades, B. Coll, and J.-M. Morel, “Nonlocal image and movie denoising,” Int. J.Comput. Vis., vol. 76, no. 2, pp. 123–139, 2008.

[122] T. Brox, O. Kleinschmidt, and D. Cremers, “Efficient nonlocal means for denoisingof textural patterns,” IEEE Trans. Image Process., vol. 17, no. 7, pp. 1083–1092,2008.

[123] M. Ghazel, G. Freeman, and E. Vrscay, “Fractal image denoising,” IEEE Trans.Image Process., vol. 12, no. 12, pp. 1560–1578, 2003.

[124] S. Awate and R. Whitaker, “Unsupervised, information-theoretic, adaptive imagefiltering for image restoration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,no. 3, pp. 364–376, 2006.

[125] K. Popat and R. Picard, “Cluster-based probability model and its application toimage and texture processing,” IEEE Trans. Image Process., vol. 6, no. 2, pp.268–284, 1997.

[126] A. Efros and T. Leung, “Texture synthesis by nonparametric sampling,” in IEEE Int.Conf. Computer Vision, 1999, pp. 1033–1038.

[127] J. D. Bonet, “Noise reduction through detection of signal redundancy,” MIT, AI Lab,Tech. Rep., 1997.

235

[128] D. Zhang and Z. Wang, “Restoration of impulse noise corrupted images usinglong-range correlation,” IEEE Signal Process. Letters, vol. 5, no. 1, pp. 4–6, 1998.

[129] ——, “Image information restoration based on long-range correlation,” IEEE Trans.Circuit Syst. Video Technol., vol. 12, no. 5, pp. 331–341, 2002.

[130] S. Kindermann, S. Osher, and P. Jones, “Deblurring and denoising of images bynonlocal functionals,” SIAM Interdisc. J., vol. 4, no. 4, pp. 1091–1115, 2005.

[131] M. Ebrahimi and E. Vrscay, “Self-similarity in imaging, 20 years after ‘fractalseverywhere’,” in Int. Workshop Local Non-Local Approx. Image Process., 2008,pp. 165–172.

[132] O. Kleinschmidt, T. Brox, and D. Cremers, “Nonlocal texture filtering with efficienttree structures and invariant patch similarity measures,” in Int. Workshop LocalNon-Local Approx. Image Process., 2008, pp. 1–8.

[133] O. Stenflo, “Perfect sampling from the limit of deterministic products of stochasticmatrices,” Electronic Commun. Prob., vol. 13, pp. 474–481, 2008.

[134] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse3-d transform-domain collaborative filtering,” IEEE Trans. Image Process., vol. 16,no. 8, pp. 2080–2095, 2007.

[135] K. Hirakawa and T. Parks, “Image denoising using total least squares,” IEEETrans. Image Process., vol. 15, no. 9, pp. 2730–2742, 2006.

[136] L. Dascal, M. Zibulevsky, and R. Kimmel, “Signal denoising by constraining theresidual to be statistically noise-similar,” Technion, Israel, Tech. Rep., 2008.

[137] E. Tadmor, S. Nezzar, and L. Vese, “A multiscale image representation usinghierarchical (BV,L2) decompositions,” Multiscale modelling and simulation, vol. 2,pp. 554–579, 2004.

[138] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterative regularizationmethod for total variation-based image restoration,” Multiscale modelling andsimulation, vol. 4, pp. 460–489, 2005.

[139] J. Polzehl and V. Spokoiny, “Image denoising: a pointwise adaptive approach,”Ann. Stat., vol. 31, no. 1, pp. 30–57, 2003.

[140] J. Kleinberg and E. Tardos, Algorithm Design. Boston, USA: Addison-WesleyLongman, 2005.

[141] D. Brunet, E. Vrscay, and Z. Wang, “The use of residuals in image denoising,” inInt. Conf. Image Anal. Recognition, 2009, pp. 1–12.

236

[142] Y. Chen, H. Wang, T. Fang, and J. Tyan, “Mutual information regularized bayesianframework for multiple image restoration,” in IEEE Int. Conf. Computer Vision,2005, pp. 190–197.

[143] B. Olshausen and D. Field, “Emergence of simple-cell receptive-field properties bylearning a sparse code for natural images,” Nature, vol. 381, no. 6583, p. 607609,1996.

[144] M. Lewicki, T. Sejnowski, and H. Hughes, “Learning overcompleterepresentations,” Neural Computation, vol. 12, pp. 337–365, 1998.

[145] M. Lewicki and B. Olshausen, “A probabilistic framework for the adaptation andcomparison of image codes,” J. Opt. Soc. Am., vol. 16, pp. 1587–1601, 1999.

[146] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparserepresentation,” in IEEE Conf. Computer Vision and Pattern Recognition, vol. 1,2006, pp. 17–22.

[147] M. Aharon, M. Elad, and A. Bruckstein, “The K-SVD: an algorithm for designing ofovercomplete dictionaries for sparse representation,” IEEE Trans. Signal Process.,vol. 54, no. 11, pp. 4311–4322, 2006.

[148] P. Chatterjee and P. Milanfar, “Clustering-based denoising with locally learneddictionaries,” IEEE Trans. Image Process., vol. 18, no. 7, pp. 1438–1451, 2009.

[149] J. Yang, J. Wright, T. Huang, and Y. Ma, “Image super-resolution as sparserepresentation of raw image patches,” in IEEE Int. Conf. Comp. Vis. Pattern Rec.,2008, pp. 1–8.

[150] Z. Wang and A. Bovik, “Mean squared error: Love it or leave it? A new look atsignal fidelity measures,” IEEE Signal Process. Mag., vol. 26, no. 1, pp. 98–117,2009.

[151] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment:From error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13,no. 4, pp. 600–612, 2004.

[152] H. Andrews and C. Patterson, “Singular value decompositions and digital imageprocessing,” IEEE Trans. Acoust., Speech and Signal Process., vol. 24, no. 1, pp.425–432, 1976.

[153] ——, “Singular value decomposition (SVD) image coding,” IEEE Trans. Commun.,vol. 24, no. 4, pp. 425–432, 1976.

[154] L. Trefethen and D. Bau, Numerical Linear Algebra. SIAM: Society for Industrialand Applied Mathematics, 1997.

[155] A. Rangarajan, “Learning matrix space image representations,” in Int. Conf.Energy Min. Methods Computer Vision Pattern Recognition, 2001, pp. 153–168.

237

[156] D. Tschumperle and R. Deriche, “Orthonormal vector sets regularization withPDEs and applications,” Int. J. Comput. Vis., vol. 50, pp. 237–252, 2002.

[157] J. Ye, “Generalized low rank approximations of matrices,” Mach. Learning, vol. 61,no. 1, pp. 167–191, 2005.

[158] C. Ding and J. Ye, “Two-dimensional singular value decomposition (2DSVD) for 2Dmaps and images,” in SIAM Int. Conf. Data Mining, 2005, pp. 32–43.

[159] N. Kwak, “Principal component analysis based on L1-norm maximization,” IEEETrans. Pattern Anal. Mach. Intell., vol. 30, no. 9, pp. 1672–1680, 2008.

[160] G. Heo, P. Gader, and H. Frigui, “RKF-PCA: Robust kernel fuzzy PCA,” NeuralNetworks, vol. 22, no. 5-6, pp. 642–650, 2009.

[161] A. Efros and W. Freeman, “Image quilting for texture synthesis and transfer,” inSIGGRAPH: Annual Conf. Computer graphics and interactive techniques, 2001,pp. 341–346.

[162] A. Hyvarinen, J. Hurri, and P. Hoyer, Natural Image Statistics: A ProbabilisticApproach to Early Computational Vision. Springer, Heidelberg, 2009.

[163] A. Rosenfeld and A. Kak, Digital Picture Processing. Orlando, USA: AcademicPress, 1982.

[164] L. Tucker, “Some mathematical notes on three-mode factor analysis,” Psychome-trika, vol. 31, no. 3, pp. 279–311, 1966.

[165] L. de Lathauwer, “Signal processing based on multilinear algebra,” Ph.D.dissertation, Katholieke Universiteit Leuven, Belgium, 1997.

[166] M. Vasilescu and D. Terzopoulos, “Multilinear analysis of image ensembles:Tensorfaces,” in Int. Conf. Pattern Recognition, 2002, pp. 511–514.

[167] K. Gurumoorthy, A. Rajwade, A. Banerjee, and A. Rangarajan, “Beyond SVD:Sparse projections onto exemplar orthonormal bases for compact imagerepresentation,” in Int. Conf. Pattern Recognition, 2008, pp. 1–4.

[168] ——, “A method for compact image representation using sparse matrix and tensorprojections onto exemplar orthonormal bases,” IEEE Trans. Image Process.,vol. 19, no. 2, pp. 322–334, 2010.

[169] D. Muresan and T. Parks, “Adaptive principal components and image denoising,”in IEEE Int. Conf. Image Process., 2003, pp. 101–104.

[170] A. Rajwade, A. Rangarajan, and A. Banerjee, “Automated filter parameterselection using measures of noiseness,” in Can. Conf. Comput. Robot Vision,2010, pp. 86–93.

238

[171] J. Weickert, “Coherence enhancing diffusion filtering,” Int. J. Comput. Vis., vol. 31,no. 3, pp. 111–127, 1999.

[172] P. Mrazek and M. Navara, “Selection of optimal stopping time for nonlineardiffusion filtering,” Int. J. Comput. Vision, vol. 52, no. 2-3, pp. 189–203, 2003.

[173] J.-F. Aujol, G. Gilboa, T. Chan, and S. Osher, “Structure-texture imagedecomposition–modeling, algorithms and parameter selection,” Int. J. Comput.Vis., vol. 67, no. 1, pp. 111–136, 2006.

[174] G. Gilboa, N. Sochen, and Y. Zeevi, “Estimation of optimal PDE-based denoisingin the SNR sense,” IEEE Trans. Image Process., vol. 15, no. 8, pp. 2269–2280,Aug. 2006.

[175] I. Vanhamel, C. Mihai, H. Sahli, A. Katartzis, and I. Pratikakis, “Scale selection forcompact scale-space representation of vector-valued images,” Int. J. Comput. Vis.,vol. 84, no. 2, pp. 194–204, 2009.

[176] D. Donoho and I. Johnstone, “Adapting to unknown smoothness via waveletshrinkage,” J. Am. Stat. Assoc., vol. 90, no. 432, pp. 1200–1224, 1995.

[177] A. Dvoretzky, J. Kiefer, and J. Wolfowitz, “Asymptotic minimax character of thesample distribution function and of the classical multinomial estimator,” Ann. Math.Stat., vol. 27, no. 3, pp. 642–669, 1956.

[178] Wikipedia, “Dvoretzky Kiefer Wolfowitz inequality,” Available from http://en.wikipedia.org/wiki/Dvoretzky-Kiefer-Wolfowitz inequality, 2010.

[179] J. Kiefer, “K-sample analogues of the Kolmogorov-Smirnov and Cramer-VonMises tests,” Ann. Math. Stat., vol. 30, no. 2, pp. 420–447, 1959.

[180] F. Anscombe, “The transformation of Poisson, binomial and negative-binomialdata,” Biometrika, vol. 35, pp. 246–254, 1948.

[181] D. Donoho, “Denoising by soft thresholding,” IEEE Trans. Inf. Theory, vol. 41,no. 3, pp. 613–627, 1995.

[182] A. D. Stefano, P. White, and W. Collis, “Training methods for image noise levelestimation on wavelet components,” EURASIP J. Appl. Signal Process., vol. 2004,no. 16, pp. 2400–2407, 2004.

[183] X. Zhu and P. Milanfar, “A no-reference image content metric and its application todenoising,” in IEEE Int. Conf. Image Process., 2010, pp. 1–4.

[184] C. Boncelet, “Image noise models,” in Handbook of Image and Video Processing,A. Bovik, Ed. New York, USA: Academic Press, 2005, pp. 397–410.

239

BIOGRAPHICAL SKETCH

Ajit Rajwade was born and brought up in the city of Pune, India. He completed

his bachelor’s degree in computer engineering from the Government College of

Engineering, Pune (affiliated to the University of Pune) in 2001, his master’s degree

in computer science from McGill University, Montreal, Canada in 2004, and his doctoral

degree in computer engineering from the University of Florida in 2010. His research

interests are in computer vision and image processing, and computational geometry.

240