optimized image storage on cloud using block wise ......the mapreduce is a programming model...
TRANSCRIPT
@IJMTER-2016, All rights Reserved 678
OPTIMIZED IMAGE STORAGE ON CLOUD USING BLOCK
WISE DEDUPLICATION TECHNIQUE
Geetha M.V1and Padma Priya M.K
2
1M.Tech (Department of Computer Science and Engineering), New Horizon College of Engineering,
Bangalore, India 2Asst Professor (Department of Computer Science and Engineering), New Horizon College of Engineering,
Bangalore, India
Abstract— Big data is a technology that deals with large volume of data. As a result accessing and
managing the data or information is very difficult. The central theme of this concept is to make big
data small by using deduplication and MapReduce techniques, thereby achieving benefits for
accessing and managing the big data. In order to achieve this goal, big data is subjected to block
level deduplication and MapReduce techniques. Deduplication is a technique to identify duplicate
images and for eliminating duplicate copies of repeating data in storage. In this paper we propose the
Duplicate image identification at block level using MapReduce technique which improves efficiency
and reliability of the System. MapReduce is simple and parallel computing techniques normally used
for analyzing the large volume of data. Traditional deduplication system works if and only if the
second image having the same underlying bits as first. In many practical applications where the
storage restriction is present, users uploads the modified images varying with the quality or
resolution.
Keywords—Deduplication; MapReduce technique; Big data; Hashing function; cloud.
I. INTRODUCTION
The big data is a technology that deals with huge volume of data that is in petabytes and
more, that includes both structured data and unstructured data. The MapReduce is a programming
model contains mapper and reducer classes which are applied to an uploaded image file. The image
is split into blocks based on the packet size and combines it while downloading the image file. The
MapReduce program is a combination of both Mapper and Reducer classes. In the map reduce
technique works based on (key, value) pairs that generates a set of intermediate results. Also reduce
function which merges all these intermediate results to obtain the final output is as shown in the
figure 1.
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 679
The deduplication is a technique which minimizes the storage space capacity and eliminates
the repeating image blocks, there by a single copy of image block will be stored in to the cloud. In
many situations, the users upload the same copy of image many number of times in to cloud. As a
result, the same copy of image is stored many times in the cloud, the capacity of memory storage
increases and redundant copies of image will be stored. Due to this retrieving the information in big
data is very difficult.
MapReduce is one of the best, simple and parallel computing techniques normally used for
analyzing the huge data. The main idea of MapReduce technique is to hide the way by which
partitioning takes place and thus to focus on the technique of data processing. In map reduce
technique users can have a key/value pair that can generate set of intermediate key and value. Also
reduce function is created which makes use of all these same intermediate keys. Number of huge real
world tasks can be effectively done using this technique. Main side of the technique is the programs
written using this model is automatically paralyzed which increase the speed of the execution.
MapReduce makes use of Google File System (GFS) as a base layer of storage from which data can
be take and store. GFS comes under chunk based data partitioning where the level of fault tolerance
is reduced by using replication and data partitioning algorithms. Apache Hadoop is an open source
framework of MapReduce.
II. RELATED WORK
[1] Elaborates the cost cutting solution by making use of MapReduce technique in place of model
building algorithms in statistical machine translation. Without MapReduce technique same task can
be accomplished by parallelizing the task but it increases the cost of hardware which in turn
increases the software cost. On 20-machine cluster system gives excellent performance and also it
does not require the cost burden of hardware. K-Nearest neighbor processing on large datasets have
great impact on the performance of system.
[2] Considers above problem as a base for their work and proposed a technique which combines the
map reduce and locality sensitive hashing (LSH). These combination gives good performances as the
mapping phases of the map reduce technique have provision hashing principle of LSH. Apart from
this different problems of map reduce and LSH are briefly explained by the author. To evaluate the
performance of the system both flickr.com dataset and synthetic datasets are considered. Real
compute cluster is kept as a future work by the writer as they currently working on the same.
[3] Discussed the techniques of removing near duplicate images from the image dataset. To use the
traditional deduplication technique proposed system presents the image in visual word. As the visual
word representation of the image loses all the geometric features, result may have higher false
positive rate when the size of the dataset increases. To increase the discriminability of the visual
word images local image features are used which groups the visual word images. Difference of
Gaussian is taken for the purpose of feature point detection.
[4] Presents a scalable data partitioning techniques for the purpose of data streams processing.
Traditional schemes used for the splitting of data are failed to achieve high degree of scalability
which degrades the performance of the system and thus increase time and cost complexity of the
system. So to overcome the problem, [4] finds a good technique. Here to alternative partitioning
techniques are proposed: partitioning based on batch and pane based partitioning. Out of the number
of techniques discussed above, pane based partitioning gives good result. To show the experimental
performance the system is tested against the linear load benchmark. Also it gives fewer loads on the
load which is splitting the data. The current work does not bother about the fault tolerance of the
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 680
system. This issue is kept as a future work by the writers. As the world is facing the problem of
managing the huge amount of data, number of techniques are proposing to get rid of these. Recent
survey conducted by IBM shows that approximately 2.5 quintillion bytes of data are being daily
generated. This data comprises of many formats such as images, videos, social media site opinions,
sensor data, transactional data etc. it is practically infeasible to deal with this data. Hence from last
decade MapReduce is evolved as promising framework to deal with this Huge amount of data.
[5] Gives a detail survey on the family of MapReduce framework. As the main advantage of the
MapReduce is to give scalable applications, it has been used in many levels from academics to the
industry. Authors try to presents the complete theories behind the map reduce: the reason behind use,
the way by which it can be used, databases supporting MapReduce etc.
III. PROPOSED SYSTEM
1. System Architecture
In our proposed system architecture, the client or user is going to upload image file through
network to the web server. In the block process the uploaded image file is split into blocks based on
the size of image. The hashing technique is applied for each and every block and the resultant hash
code will be generated. The logical block address (LBA) of each block is generated by LBA
technique. In the multi-level hash indexing the newly generated blocks hash code is compared
against with the existing blocks hash code. If the block already exists we are just mapping to that
existing block rather than storing again it into cloud to achieve deduplication. Finally, storing only
non redundant copies of blocks into the cloud storage as shown in figure 2.
3.2 Decomposition Description
A flow chart is a representation of an algorithm in a pictorial or graphical form by using various
kinds of boxes, arrows and ect, as shown in the figures 3 and 4.
While uploading the file, what are all the steps and operations to be performed to upload
image file into cloud are as shown in the figure 3.
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 681
While downloading the particular requested image file by user, what are all the steps and
operations to be performed to download the image fileis as shown in the figure 4.
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 682
Figure 4: Flow chart of File Download
IV. IMPLEMENTATION RESULTS
The implementation includes several modules namely; Login, Show Profile, Upload file,
download file and transactions. Once the user enters with a valid user name and password, he can log
in to a home page, where he is allowed to perform four operations namely; view profile, image
uploading, image downloading, and view transactions as shown in figure 5.
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 683
Figure 5: Home page to perform various operations
4.1 Show Profile Module
Here, based on the user name from the login details the user profile details are fetched from
the database and displayed as shown in Figure 6.
4.2 Upload File Module
Here, the user is going to choose an image file and clicks on submit button to upload the file
in to cloud. He can delete the file by selecting the particular image file and clicks on the delete button
as shown in figure 7.
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 684
Figure 7: Upload file
4.3 Download File Module
Here, the user can download the image file from the cloud by selecting particular image file
and clicks on the download button as shown in figure 8.
Figure 8: Download file
4.4 Transactions Module
Here, the transactions details are displayed based on the user name of particular user,
thelogin, logout, profile view, uploaded file, deleted file, downloaded file details of particular user
along with the date and time as shown in Figure 9.
International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 03, Issue 04, [April– 2016] ISSN (Online):2349–9745; ISSN (Print):2393-8161
@IJMTER-2016, All rights Reserved 685
Figure 9: View transactions
V. CONCLUSION
In this paper, we propose block level image deduplication using MapReduce model, the
accessing and managing the big data easier, it makes the information retrieval faster, reduces the
access time and big data smaller. MapReduce technique is used for fasten duplicate image detection
process at block level. This System reduces the time required to identify the duplicate images in
cloud storage and avoids the redundant copies of same images by storing single copy of image
blocks using map reducing technique.
REFERENCES
[1] Dyer, Christopher, et al. "Fast, easy, and cheap: Construction of statistical machine translation models with
MapReduce." Proceedings of the Third Workshop on Statistical Machine Translation. Association for Computational
Lingguistics, 2008.
[2] Stupar, Aleksandar, Sebastian Michel, and Ralf Schenkel. "RankReduce processing k-nearest neighbor queries on
top of MapReduce."Proceedings of the 8th Workshop on Large-Scale Distributed Systems for Information Retrieval.
2010.
[3] Wen, Tzay-Yeu. "Large Scale Image Deduplication."
[4] Balkesen, Cagri, and Nesime Tatbul. "Scalable data partitioning techniques for parallel sliding window processing
over data streams." International Workshop on Data Management for Sensor Networks (DMSN). 2011.
[5] Sakr, Sherif, Anna Liu, and Ayman G. Fayoumi. "The family of MapReduce and large-scale data processing
systems." ACM Computing Surveys (CSUR)46.1 (2013): 11.