additive and multiplicative noise removal framework for large scale color satellite images

www.as-se.org/ssms Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

10

Additive and Multiplicative Noise Removal Framework for Large Scale Color Satellite Images on OpenMP and GPUs Banpot Dolwithayakul1, Chantana Chantrapornchai2, Noppadol Chumchob3 1,2 Department of Computing, Faculty of Science Silpakorn University, Nakhon-Pathom, Thailand 3 Department of Mathematics, Faculty of Science Silpakorn University, Nakhon-Pathom and Centre of Excellence in Mathematics, CHE, Si Ayutthaya Rd., Bangkok, Thailand *[email protected]; [email protected]; 3

[email protected]

Abstract

The satellite images are usually contaminated with multiplicative noises and some additive noises [1, 2]. Due to the large size of images, the removal process of these two types of noises at real-time is time consuming. The use of many-core processors such as GPUs may be advantageous in reducing the time of denoising. However, with the limitation of the GPU memory and the memory transfer cost, the proper design for denoising the large images is required. In this paper, we introduce the novel method for denoising both additive and multiplicative noises on multiple GPUs. The method is extended from [8] to perform a large-image denoising. It considers the proper data fitting to the GPU memory, memory utilization and thread utilization on both the CPU and GPUs. The speedup on the computation time of upto 87.29 times can be achieved compared with the sequential computation on the color 40964096 satellite image.

Keywords

Image Denoising; Satellite Image; GPU; Fixed-point Iterative Method; Parallel Computing; High Performance Computing

Introduction

In image processing, noises in images are usually categorized into two models: additive and multipli-cative noises. The former one is called additive Gaussian white noises which can usually be found in acquired images via digital devices. This type of noise model has been investigated for a long time by previous researches. There are a variety of algorithms for removing the additive noises, for example, nonlinear total variation by Rudin, Osher and Fatemi[4]. The additive noise model is usually written in Equation (1)

= + . (1) Here, z is the corrupted image, u is the original image

and is the noise on the image.

Next, the so-called multiplicative noise (a.k.a. speckle noise) is found in the images obtained from synthetic aperture radar (SAR), ultrasound and sonar. The multiplicative noise is in the form of Equation (2)

= . (2) From recent researches, Hirakawa and Parks [9] and Lukin et al.[10] concluded that some images may not consist of the pure additive noises or multiplicative noises. The authors in [9] concluded that both noise models should be combined into general case, as expressed by Equation (3).

= + (0 + 1). (3) where 0 and 1 are parameters indicating the amount of additive and multiplicative noises are in the image.

The novel and robust algorithm proposed by N. Chumchob, K. Chen and C. Brito-Loeza [3] is efficient in removing both types of noises by combining two techniques: ROF model [4] and JY model [5].

In this paper, we use this method as the main technique for removing both types of noises from satellite images. The main challenge in this work is to remove the noises in real-time since the considered satellite images are quite large. We take advantages of the many-core technology and design an efficient parallel denoising method for such an image.

In general cases, the satellite images denoising process is the time consuming process. The GPU may be used for speeding up the overall computation time. In contrast, it is well known by the nature of GPUs that the memory transfer between a host and devices costs many cycles. Moreover, the satellite images are usually large which cannot be fit in available memory

Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013 www.as-se.org/ssms

11

of a GPU as a whole, Thus, the proper computation strategy and memory management strategy are required to overcome these problems.

In this paper, we propose a new parallel denoising algorithm for large color satellite images. The algorithm distributes the work to GPU threads and CPU threads as soon as they become available. It also deallocates some finished denoised color channel to reuse the memory space for new data arrived from the CPU.

The rest of this paper is organized as following: Backgrounds Section explains the algorithm which is used for denoising both additive and multiplicative noises. Proposed Strategy Section shows our proposed strategy, Next section is our experimental results and the last section is the conclusion, discussion and future work.

Backgrounds

Noise Removal Algorithm

In order to remove both additive and multiplicative noises from color images, we assume that each channel (Red-Green-Blue) is independent and adaptive to the variation model arg min{1,2 (1,2,3)= | | + 123=1 ( 3=1

)2 + 2 ( + )}

. (4)

here : 2 [0,255] , 1 > 0 and 2 > 0 are regularized parameters fitting for additive noises and multiplicative noises removal respectively, =[0, ] [0,] is the domain of image. By using Euler-Lagrance equations, the variation model is written as Equation (5)

() + 1( ) + 2(1 )

() = 0.

= 0 (5)

where () = |)| , || = ||2 + and > 0 is a small constant to avoid the singularity. By using the finite difference method for discretization to discrete domain where h is the distance

between grid points, we discretize the domain into grid cells. Each cell has size of 11 ( = =1). The discrete equation on ( , ) on the is written as follows

(), + 1(), (), + 2 1 (),, () =

, , (6)

where

, = + , + , +

, ,

. (7)

From Equation (6), there are several methods to solve it. For example, time marching technique is the simple iterative technique by using a synthetic time variable. However, this method has very slow convergence rate and not suitable for parallel computation because of data dependency in each iteration. Refer to [3] and references therein.

Asynchronous Parallel Gauss-Seidel [8]

In this paper, we combine our previous work in [8] with the so-called local fixed-point method proposed by [3]. This method is a state-based method which consists of 4 states to make each thread work independent. Each state is described as the follows:

Waiting State - Thread working on this state will keep searching for its assigned job in the job table.

Working State - Thread will compute the Gauss-Seidel algorithm on its current cell and change to the next state.

Validation State - Thread will validate and wait for solving data dependency to ensure the correctness of algorithm before update data on the current cell.

Shifting State - Thread will decide if it will shift to work on the right cell or move back to the waiting state.

This asynchronous approach outperforms the earlier algorithm Sliding Window Gauss-Seidel on the multi-core processor [8]. However, this method uses additional memory to store the job table and 2-dimen-sional matrix for storing the iteration number and the states of each thread.


12

Compute Unified Device Architecture

Compute Unified Device Architecture (CUDA) is the architecture for Single Instruction Multiple Data (SIMD) from NVIDIA. This render graphic card with CUDA can be used as a general-purposed processor other than just graphic processing which is called the General-Purposed Graphic Processing Unit (GPGPU).

For programming and developing on CUDA, a developer has to specify the number of threads for computation. Threads executed in a kernel must be organized as a group of threads with the shared data called thread blocks. A group of blocks forms a grid. Creating, organizing and destroying threads on the GPU consume only a little of resources. This allows the developers to manage hundreds of threads very fast and effectively.

CUDA uses 4 levels of memory. The first level of memory, which is on a GPU is called "Global memory". The global memory access is the slowest for a GPU. Hundreds of clock cycles are needed to access this kind of memory. The next level is called "shared memory," the fastest memory that a user can allocate and manage on a GPU device. Reading and writing through the shared memory uses approximately 40 clock cycles. The other two levels are local and texture memory. Both memory types have large memory space and can be allocated by users. They require the more cycles than the shared memory.

In this paper, we used CUDA architecture for our experiment. However, the usage of OpenCL which is hardware independent is also possible. Our implementation can be extended to the OpenCL framework in the future.

The proposed strategy

To design the strategy effectively, we first measure the time for each fixed-point step of local fixed-point. We

FIG. 1 APPROXIMATION OF COMPUTATION TIME ON EACH

PART OF THE DENOISING ALGORITHM

divide the computation into 5 parts. For example, we measure the sequential computation time used for denoising on the 10241024 image. The results are shown in FIG. 1.

From FIG. 1, we found that the most time consuming part on each iteration is nonlinear Gauss-Seidel. Thus, we design our method for computing parallel Gauss-Seidel on GPUs.

The strategy for denoising images on two GPUs is shown in FIG. 2.

FIG. 2 THE PROPOSED STRATEGY FOR DENOISING SATELLITE

IMAGES ON TWO GPUS.

Here are the brief explanations of our strategy. Our strategy consists of 5 parts as following:

1) Initialize

At the beginning of computation, the CPU will partially read the satellite image from disk.

2) Image Decomposition

The CPU will divide the main image into chunks by the size specified by n. For denoising the satellite image seamlessly, the outer boundary pixels in four directions are needed. Thus, the chunk size will be ( + 1) ( + 1) for the chunk on the four corners as in FIG. 3(a), and ( + 1) ( + 2) or ( + 2) ( + 1) for chunks on the images boundary and ( + 2) ( + 2) for chunks located on elsewhere as in FIG. 3(b).


13

3) CPU to GPU Data transfer

G threads on the CPU will work in parallel via OpenMP to send data to GPUs, where G is the number of graphic cards. The CPU will transfer the fetched chunks to all two GPUs at the same time.

4) Denoise Process

Each CPU thread will invoke CUDA kernel. The denoised process will be done on the GPUs by using asynchronous Gauss-Seidel technique proposed in [8] for the local fixed-point technique. After the thread finishes denoising a chunk, it will transfer a chunk back to the main memory and deallocate finished chunk on the GPUs to free the space on the global Memory.

5) Finalize

The CPU will combine all denoised chunks from the GPUs to create a new large denoised image and save it to the disk afterward.

The satellite image decompositions are illustrated as FIG. 3.

(a)

(b)

FIG. 3 SATELLITE IMAGE DECOMPOSITION FOR FIRST CHUNK (A) AND FIFTH CHUNK (B) FOR 3X3 CHUNK SIZE ON 99

IMAGE SIZE

On FIG. 3, the black area indicate the area for the chunk and the red dash line indicates the actual data which is needed to decompose from the image to each thread. It has the size of (n+2) (n+2) where n is the size of chunks in one dimension.

Experiment Results

We implemented our method on two NVIDIA GTX-560 GPUs with 384 stream processors and 2GB of memory on Intel Core i5 with 4 cores of CPU and the total main memory is 8GB. We use the 64-bit version of Fedora 16 Linux with GCC-C++ 4.6.0 compiler with gdb enabled and OpenCV 2.3 library for the image manipulation. The experiments were made with real satellite photos of Dindang district in Bangkok, Thailand. They were captured from IKONOS satellite. Our results consist into 3 parts as follows:

Performance Evaluation

We measure the computational time varying the number of chunk size on 256 threads computation on each GPU as FIG. 4.

FIG. 4 TOTAL COMPUTATIONAL TIME VARYING THE CHUNK

SIZE

From FIG. 4, it shows clearly that the smaller chunk size will increase the overall number of computation time because each chunk needs to compute data on its border of each chunk. The smaller chunk will imply the additional number of border cells to be denoised.

The total number of cells needed to be computed on the 20482048 image size varying the chunk size is illustrated as FIG. 5.

However, we have made some modification to our strategy by dividing threads on the GPUs into groups and denoised several smaller chunks at the same time.


14

We have tried varying the number of threads in the group and chunk size. The results are shown in FIG. 6.

FIG. 5 THE NUMBER OF CELLS NEED TO BE COMPUTED

VARYING DOMAIN SIZE ON 20482048 IMAGE SIZE

FIG. 6 TOTAL COMPUTATIONAL TIME VARYING THE

NUMBER OF GROUPS AND CHUNK SIZE

The computation repeated until there is no difference between two consecutive images from the previous and current iteration.

From FIG. 6, it is obvious to divide threads into groups and let the threads work on each chunk at the same time to decrease the total computation time in most cases. On the larger chunk size, dividing threads into the large number of groups (which contains the smaller number of threads per group) will increase the computation time. It is found that the proper number of groups on the 256256 chunk size is 8 groups (which has 32 threads per group).

FIG. 6 indicates that if we use the large chunk size (256256 and 512512) and the number of group more than 8, it will decrease the overall performance because the larger number of groups means the less thread assigned on each chunk. The large chunk size requires appropriate number of threads under the specific number of GPU cores to reach the optimum performance.

We define speedup as tseq/tgpu where tseq is the total computation time in the sequential version and tgpu

is the total computation time on GPUs. The speedup of 256256 chunk size varying the total number of threads with 8 group of threads is displayed as FIG. 7.

FIG. 7 SPEEDUP COMPARED WITH THE SEQUENTIAL COMPUTATION ON THE 4096X4096 IMAGE SIZE

From FIG. 7, the achieved maximum speedup is 87.29 times comparing with the sequential computation using 512 threads with the chunk size of 256256 and 8 groups of threads.

Denoised Images Quality Evaluation

The noisy satellite images and denoised satellite images are shown in FIG. 8 and FIG. 9.

(A) (B)

(C) (D)

FIG. 9 EXAMPLE OF ORIGINAL SATELLITE IMAGE (A) AND DENOISED IMAGE(B) AND CLOSE-UP ZOOM FOR NOISY (C) AND DENOISED (D) AT THE DINDANG DISTRICT SECTION 2

BY IKONOS SATELLITE.


15

(A) (B)

FIG. 8 EXAMPLE OF THE ORIGINAL SATELLITE IMAGE (A) AND DENOISED IMAGE(B) AT THE DINDANG DISTRICT

SECTION 1 BY IKONOS SATELLITE.

Memory Space Complexity Evaluation

We also evaluate the average memory space com-plexity for the overall computation. We define F as the space required by the floating point (usually 4 bytes for GCC), and N is the image size. On the normal computation, the memory required by the algorithm for the 3-channel RGB color image is in Equation (8)

2(3 )O FN (8)

Our proposed method divides the satellite image into small chunks. We define c as the number of chunks and G is the number of GPUs. Each chunk will use the memory space in the order of ( + 2)2, where n is the size of chunk in one dimension. The space complexity required is displayed in Equation (9) assuming each GPU will process one chunk at a time.

2(( 2) 3 )O n GF+ (9)

On our system, we use two GPUs (G=2) and the chunk size of 256256 with 8 groups of threads at the same time on each GPU card. We can rewrite Equation (9) for our environment as Equation (10)

(532512 )O F (10) This means the memory usage is constant for any size of original image.

Conclusions and Future Works

We propose a new strategy for improving satellite images quality by removing both additive and multiplicative noises from color images. Our method can work on both shared memory system and distributed memory system by decomposing the image into small chunks which can be fit on each GPUs memory. The result shows that our strategy is able to achieve the speedup up to 87.29 time compared to the sequential computation. The quality of denoised

images is visually satisfactory.

However, the color satellite images have three channels. This noise removal technique does remove each channel individually. In fact, it is possible that there are some dependencies of multiplicative and additive noises on each channel. We need to improve the mathematical model and investigate the dependency of noises on each channel in the future.

Our framework is tested on CUDA architecture but the usage of OpenCL [13] is possible in the same way. This framework can be easily extended to other distributed memory model such as Message Passing Interface (MPI) or the cloud implementation. We will further investigate the data transfer time on these implementations next.

Additionally, we will integrate the other satellite image improvement such as the cloud fog removal [11] and the strip noise removal [12] to our approach to further improve the image quality.

ACKNOWLEDGMENT This work is supported in part by the Thailand Research Fund through the Royal Golden Jubilee Ph.D. Program., contract no. PHD/0275/2551.

We would like to thank Dr. Ornprapa P. Robert for providing us sample satellite images from IKONOS satellites.

REFERENCES

A. Munshi, OpenCL Parallel Computing on the GPU and

CPU, International Conference and Exhibition on

Computer Graphics and Interactive Technique

(SIGGRAPH 2008), 2008.

B. Dolwithayakul, C. Chantrapornchai, N. Chumchob, An

efficient asynchronous approach for Gauss-Seidel

iterative solver for FDM/FEM equations on multi-core

processors., Proceeding of International Joint

Conference on Computer Science and Software

Engineering (JCSSE 2012), 2012, pp.357361.

C. R. Vogel and M. E. Oman, Fast, Robust total variation-

based reconstruction of noisy, blurred images, IEEE

Transaction of Image Processing, Vol.7, 1998, pp.813

824.

C. R. Vogel and M. E. Oman, Iterative methods for total

variation denoising, SIAM Journal of Sci. Comput.,

Vol.17, 1996, pp.227238.


16

E. Choi and M. G. Kang, Striping Noise Removal of Satellite

Images by Nonlinear Mapping, Lecture Notes in

Computer Science, Vol.4142, 2006, pp.722729.

K. Hirakawa and T.W. Parks, Image denoising using total

least squares, IEEE Trans. Image Process. 15(9) (2006), pp.

2730--2742.

L. Rudin, S. Osher and E. Fatemi, Nonlinear total variation

based noise removal algorithms, Physica D., vol 60,

1992, pp.130120.

N. Chumchob, K. Chen and C. Brito-Loeza, A new

variational model for removal of combined additive and

multiplicative noise and a fast algorithm for its

numerical approximation, International Journal of

Computer Mathematics, 2012, 112.

N.N. Ponomarenko, S.K. Abramov, O. Pogrebnyak, K.O.

Egiazarian, V.V. Lukin, D.V. Fevralev, and J.T. Astola,

Discrete cosine transform-based local adaptive filtering

of images corrupted by nonstationary noise, J. Electron.

Imaging. Vol.19, 2010.

S. S. Al-amri, N. V. Kalyankar and S. D. Khamitkar, A

Comparative Study of Removal Noise from Remote

Sensing Image, IJCSI International Journal of Computer

Science, vol.7, 2010, pp. 3236.

Y. Iikura, Estimation of noise component in satellite images

and its application, Geoscience and Remote Sensing

Symposium, 1995, pp.102104.

Y. Li, J. Chen, Y. Wang and R. Lu, An Effective Approach to

Remove Cloud-fog Cover and Enhance Remotely Sensed

Imagery, Proceeding of Geoscience and Remote Sensing

Symposium, 2005, pp.42524255.

Z. Jin and X. Yang, Analysis of a new variational model for

multiplicative noise removal, Journal of Math. Anal.

Appl. Vol.362, 2010, pp.415426.

additive and multiplicative noise removal framework for large scale color satellite images

Documents