optimized image storage on cloud using block wise ......the mapreduce is a programming model...

@IJMTER-2016, All rights Reserved 678

OPTIMIZED IMAGE STORAGE ON CLOUD USING BLOCK

WISE DEDUPLICATION TECHNIQUE

Geetha M.V1and Padma Priya M.K

2

1M.Tech (Department of Computer Science and Engineering), New Horizon College of Engineering,

Bangalore, India 2Asst Professor (Department of Computer Science and Engineering), New Horizon College of Engineering,

Bangalore, India

Abstract— Big data is a technology that deals with large volume of data. As a result accessing and

managing the data or information is very difficult. The central theme of this concept is to make big

data small by using deduplication and MapReduce techniques, thereby achieving benefits for

accessing and managing the big data. In order to achieve this goal, big data is subjected to block

level deduplication and MapReduce techniques. Deduplication is a technique to identify duplicate

images and for eliminating duplicate copies of repeating data in storage. In this paper we propose the

Duplicate image identification at block level using MapReduce technique which improves efficiency

and reliability of the System. MapReduce is simple and parallel computing techniques normally used

for analyzing the large volume of data. Traditional deduplication system works if and only if the

second image having the same underlying bits as first. In many practical applications where the

storage restriction is present, users uploads the modified images varying with the quality or

resolution.

Keywords—Deduplication; MapReduce technique; Big data; Hashing function; cloud.

I. INTRODUCTION

The big data is a technology that deals with huge volume of data that is in petabytes and

more, that includes both structured data and unstructured data. The MapReduce is a programming

model contains mapper and reducer classes which are applied to an uploaded image file. The image

is split into blocks based on the packet size and combines it while downloading the image file. The

MapReduce program is a combination of both Mapper and Reducer classes. In the map reduce

technique works based on (key, value) pairs that generates a set of intermediate results. Also reduce

function which merges all these intermediate results to obtain the final output is as shown in the

figure 1.

International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161


The deduplication is a technique which minimizes the storage space capacity and eliminates

the repeating image blocks, there by a single copy of image block will be stored in to the cloud. In

many situations, the users upload the same copy of image many number of times in to cloud. As a

result, the same copy of image is stored many times in the cloud, the capacity of memory storage

increases and redundant copies of image will be stored. Due to this retrieving the information in big

data is very difficult.

MapReduce is one of the best, simple and parallel computing techniques normally used for

analyzing the huge data. The main idea of MapReduce technique is to hide the way by which

partitioning takes place and thus to focus on the technique of data processing. In map reduce

technique users can have a key/value pair that can generate set of intermediate key and value. Also

reduce function is created which makes use of all these same intermediate keys. Number of huge real

world tasks can be effectively done using this technique. Main side of the technique is the programs

written using this model is automatically paralyzed which increase the speed of the execution.

MapReduce makes use of Google File System (GFS) as a base layer of storage from which data can

be take and store. GFS comes under chunk based data partitioning where the level of fault tolerance

is reduced by using replication and data partitioning algorithms. Apache Hadoop is an open source

framework of MapReduce.

II. RELATED WORK

[1] Elaborates the cost cutting solution by making use of MapReduce technique in place of model

building algorithms in statistical machine translation. Without MapReduce technique same task can

be accomplished by parallelizing the task but it increases the cost of hardware which in turn

increases the software cost. On 20-machine cluster system gives excellent performance and also it

does not require the cost burden of hardware. K-Nearest neighbor processing on large datasets have

great impact on the performance of system.

[2] Considers above problem as a base for their work and proposed a technique which combines the

map reduce and locality sensitive hashing (LSH). These combination gives good performances as the

mapping phases of the map reduce technique have provision hashing principle of LSH. Apart from

this different problems of map reduce and LSH are briefly explained by the author. To evaluate the

performance of the system both flickr.com dataset and synthetic datasets are considered. Real

compute cluster is kept as a future work by the writer as they currently working on the same.

[3] Discussed the techniques of removing near duplicate images from the image dataset. To use the

traditional deduplication technique proposed system presents the image in visual word. As the visual

word representation of the image loses all the geometric features, result may have higher false

positive rate when the size of the dataset increases. To increase the discriminability of the visual

word images local image features are used which groups the visual word images. Difference of

Gaussian is taken for the purpose of feature point detection.

[4] Presents a scalable data partitioning techniques for the purpose of data streams processing.

Traditional schemes used for the splitting of data are failed to achieve high degree of scalability

which degrades the performance of the system and thus increase time and cost complexity of the

system. So to overcome the problem, [4] finds a good technique. Here to alternative partitioning

techniques are proposed: partitioning based on batch and pane based partitioning. Out of the number

of techniques discussed above, pane based partitioning gives good result. To show the experimental

performance the system is tested against the linear load benchmark. Also it gives fewer loads on the

load which is splitting the data. The current work does not bother about the fault tolerance of the



system. This issue is kept as a future work by the writers. As the world is facing the problem of

managing the huge amount of data, number of techniques are proposing to get rid of these. Recent

survey conducted by IBM shows that approximately 2.5 quintillion bytes of data are being daily

generated. This data comprises of many formats such as images, videos, social media site opinions,

sensor data, transactional data etc. it is practically infeasible to deal with this data. Hence from last

decade MapReduce is evolved as promising framework to deal with this Huge amount of data.

[5] Gives a detail survey on the family of MapReduce framework. As the main advantage of the

MapReduce is to give scalable applications, it has been used in many levels from academics to the

industry. Authors try to presents the complete theories behind the map reduce: the reason behind use,

the way by which it can be used, databases supporting MapReduce etc.

III. PROPOSED SYSTEM

1. System Architecture

In our proposed system architecture, the client or user is going to upload image file through

network to the web server. In the block process the uploaded image file is split into blocks based on

the size of image. The hashing technique is applied for each and every block and the resultant hash

code will be generated. The logical block address (LBA) of each block is generated by LBA

technique. In the multi-level hash indexing the newly generated blocks hash code is compared

against with the existing blocks hash code. If the block already exists we are just mapping to that

existing block rather than storing again it into cloud to achieve deduplication. Finally, storing only

non redundant copies of blocks into the cloud storage as shown in figure 2.

3.2 Decomposition Description

A flow chart is a representation of an algorithm in a pictorial or graphical form by using various

kinds of boxes, arrows and ect, as shown in the figures 3 and 4.

While uploading the file, what are all the steps and operations to be performed to upload

image file into cloud are as shown in the figure 3.



While downloading the particular requested image file by user, what are all the steps and

operations to be performed to download the image fileis as shown in the figure 4.



Figure 4: Flow chart of File Download

IV. IMPLEMENTATION RESULTS

The implementation includes several modules namely; Login, Show Profile, Upload file,

download file and transactions. Once the user enters with a valid user name and password, he can log

in to a home page, where he is allowed to perform four operations namely; view profile, image

uploading, image downloading, and view transactions as shown in figure 5.



Figure 5: Home page to perform various operations

4.1 Show Profile Module

Here, based on the user name from the login details the user profile details are fetched from

the database and displayed as shown in Figure 6.

4.2 Upload File Module

Here, the user is going to choose an image file and clicks on submit button to upload the file

in to cloud. He can delete the file by selecting the particular image file and clicks on the delete button

as shown in figure 7.



Figure 7: Upload file

4.3 Download File Module

Here, the user can download the image file from the cloud by selecting particular image file

and clicks on the download button as shown in figure 8.

Figure 8: Download file

4.4 Transactions Module

Here, the transactions details are displayed based on the user name of particular user,

thelogin, logout, profile view, uploaded file, deleted file, downloaded file details of particular user

along with the date and time as shown in Figure 9.



Figure 9: View transactions

V. CONCLUSION

In this paper, we propose block level image deduplication using MapReduce model, the

accessing and managing the big data easier, it makes the information retrieval faster, reduces the

access time and big data smaller. MapReduce technique is used for fasten duplicate image detection

process at block level. This System reduces the time required to identify the duplicate images in

cloud storage and avoids the redundant copies of same images by storing single copy of image

blocks using map reducing technique.

REFERENCES

[1] Dyer, Christopher, et al. "Fast, easy, and cheap: Construction of statistical machine translation models with

MapReduce." Proceedings of the Third Workshop on Statistical Machine Translation. Association for Computational

Lingguistics, 2008.

[2] Stupar, Aleksandar, Sebastian Michel, and Ralf Schenkel. "RankReduce processing k-nearest neighbor queries on

top of MapReduce."Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval.

2010.

[3] Wen, Tzay-Yeu. "Large Scale Image Deduplication."

[4] Balkesen, Cagri, and Nesime Tatbul. "Scalable data partitioning techniques for parallel sliding window processing

over data streams." International Workshop on Data Management for Sensor Networks (DMSN). 2011.

[5] Sakr, Sherif, Anna Liu, and Ayman G. Fayoumi. "The family of MapReduce and large-scale data processing

systems." ACM Computing Surveys (CSUR)46.1 (2013): 11.

optimized image storage on cloud using block wise ......the mapreduce is a programming model...

Documents