private content based image information retrieval using map … · 2019-05-17 · private content...
TRANSCRIPT
Private Content Based Image InformationRetrieval using Map-Reduce
A Dissertation
submitted in fulfillment for the award of
the degree
Master of Technology,
Computer Engineering
Submitted by
Arpit D. Dongaonkar
MIS No: 121122006
Under the guidance of
Prof. Sunil B. Mane
College of Engineering, Pune
DEPARTMENT OF COMPUTER ENGINEERING ANDINFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE-5June, 2013
DEPARTMENT OF COMPUTER ENGINEERING
AND
INFORMATION TECHNOLOGY,
COLLEGE OF ENGINEERING, PUNE
CERTIFICATE
This is to certify that the dissertation entitled
“Private Content Based Image Information Retrieval using
Map-Reduce”
has been successfully completed
by
Mr. Arpit Dilip Dongaonkar
MIS No. 121122006
as a fulfillment of End Semester evaluation of
Master of Technology.
SIGNATURE SIGNATURE
Prof Sunil B. Mane Dr. J. V. Aghav
Project Guide Head
Dept. of Computer Engineering Dept. of Computer Engineering
and Information Technology, and Information Technology,
College of Engineering Pune, College of Engineering Pune,
Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.
Dedicated to
My Mother
Sunita D. Dongaonkar,
My Father
Prof. Dilip R. Dongaonkar
for their love and everything, that they have given me.......
and also to
All family members
for supporting me.......
Acknowledgements
I would like to express my sincere gratitude towards Prof. Sunil B. Mane , my re-
search guide, for his patient guidance, enthusiastic encouragement and useful critiques on
this research work. Without his invaluable guidance, this work would never have been a
successful one.
I would also like to thank Vikas Jadhav, my classmate for continuous and detail knowl-
edge sharing on Hadoop and other aspects from installation to working. I would also like
to thank all my M.Tech friends for unconditional support and help.
Last but not least, my sincere thanks to all my teachers who directly or indirectly helped
me to learn new technologies and complete my post graduation successfully.
Arpit Dilip Dongaonkar
College of Engineering, Pune
Abstract
Today, Role of Information technology is changing and this change is leading to real life
software application developments i.e. Information technology has entered every part of
our day to day life like traffic signals, power grids, our day to day financial transactions
are handled by Information technology. But this technology comes with a cost to pay for
it. As this technology is touching each and every layer of life and community, there is a
need to make this technology available to each and every layer of community.
Cloud computing is an emerging technology which reduces this cost while providing very
good and efficient information technology platform as well as service. Cloud reduces
the overload on hard disk, operating system and eventually on CPU processing. Cloud
providers provide infrastructure, platform, database etc. for you to install your operating
system, your required software’s, to store your data on their cloud. But this service comes
at a cost. Cloud providers provide it on pay per hour basis. So any application which
can be deployed on cloud must be fast, efficient and secure.
As Internet is expanding, creation of data over it is also increasing exponentially. Max-
imum part of this data contains images. Today large amount of variety image data is
produced through digital cameras, mobile phones, and photo editing softwares etc. These
images are private to particular user. This digital image data should be secured such that
it should not be accessed by others except the user.
There are some domains which uses content based image retrieval applications. These
applications should be fast enough to carry out user functionalities efficiently and secure
in order to protect user data. If application is to be deployed on cloud, then it should be
supported by cloud structure. One technology which has support from cloud and does
distributed parallel computing is “Map Reduce”.
I am proposing a system which will allow user to upload, search and retrieve some of
users personal images depending upon one particular image, securely and efficiently from
his continuously expanding image database. This system is incorporated with encryp-
tion technique for security and Map-Reduce technique for efficiently upload, search and
retrieval of images over large dataset of user images.
Proposed solution will be a system to securely upload, search and query images among
massive image data storage using content based image retrieval scheme using Map Re-
duce technique, and evaluating and comparing proposed solution with existing solution.
Keywords - Information technology, Cloud computing, secured personal data.
Contents
List of Tables iv
List of Figures vi
1 INTRODUCTION 1
1.1 Private Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 Content Based Image Retrieval . . . . . . . . . . . . . . . . . . . 2
1.3 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Chord Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.3 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Oblivious Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Survey 10
2.1 PIRMAP [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 PIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 MapReduce in PIRMAP . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Working of PIRMAP . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.4 Future Scope/Issues . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Map/Reduce in CBIR Application [2] . . . . . . . . . . . . . . . . . . . . 12
2.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
i
2.2.2 Future scope/Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 An Oblivious Image Retrieval Protocol [3] . . . . . . . . . . . . . . . . . 13
2.3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Future Scope/Issues . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Distributed Image Retrieval System Based on MapReduce.[4] . . . . . . . 15
2.4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.2 Future Scope/Issues . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Motivation 17
3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Scope of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 System Requirement Specifications . . . . . . . . . . . . . . . . . . . . . 19
4 System Design 20
4.1 Complete Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.1 Upload Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Implementation of System 25
5.1 Log In Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Create Account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 User Operation Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.1 Changes in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3.2 Upload Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3.3 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6 Experiments, Testing and Result Analysis 36
6.1 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2 Experiments and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.4 Error Check in Application . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.5 Comparison of Different Systems . . . . . . . . . . . . . . . . . . . . . . 44
7 System Output 45
7.1 Input Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8 Conclusions and Future Scope 47
8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A Publication Status 49
List of Tables
6.1 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6.2 Second Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . 37
iv
List of Figures
1.1 Map Reduce Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Upload Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Image Retrieval Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1 LogIn Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Create Account Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.3 Account Creation Screen - Validation . . . . . . . . . . . . . . . . . . . . 27
5.4 User Operations Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Upload Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.6 Uploading Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.7 Process behind Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.8 Search Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.9 Searching Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.10 Processes behinds search Screen . . . . . . . . . . . . . . . . . . . . . . . 34
6.1 One Node Cluster Difference at Upload MapReduce . . . . . . . . . . . . 39
6.2 One Node Cluster Difference at Retrieval MapReduce . . . . . . . . . . . 39
6.3 Three Node Cluster Difference at Upload MapReduce . . . . . . . . . . . 40
6.4 Three Node Cluster Difference at Retrieval MapReduce . . . . . . . . . . 40
6.5 Performance of Upload MapReduce at Cluster . . . . . . . . . . . . . . . 41
6.6 Performance of Retrieval MapReduce at Cluster . . . . . . . . . . . . . . 41
6.7 Scale Up between Different Cluster Nodes . . . . . . . . . . . . . . . . . 42
6.8 Scale Up between Different Cluster Nodes . . . . . . . . . . . . . . . . . 42
6.9 Error Checking in Email . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
v
6.10 Error Check at Upload Images . . . . . . . . . . . . . . . . . . . . . . . . 43
6.11 Error check at Search Images . . . . . . . . . . . . . . . . . . . . . . . . 44
6.12 Error check at Search Images . . . . . . . . . . . . . . . . . . . . . . . . 44
7.1 Input set of Images Uploaded . . . . . . . . . . . . . . . . . . . . . . . . 45
7.2 Input Search Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.3 Output of System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Chapter 1
INTRODUCTION
1.1 Private Information Retrieval
Today most of data over Internet lies into database servers which are connected to Inter-
net. If a user makes a query to database, he/she should get required queried information.
There are many curious users who makes query to database intentionally or uninten-
tionally to get private information of other users. Lot of research is devoted to protect
databases from these types of curious users. Private information retrieval is a technique
or a research area where security with respect to this type of intrusion has been considered.
A private information retrieval (PIR) protocol allows a user to retrieve an item from a
server in possession of a database without revealing which item he/she is retrieving.
Over period different improvements in PIR has been proposed which makes PIR faster,
more efficient and more secure. Implementation like oblivious transfer is also implemented
in PIR. This restricts the user from getting information about other database items [3].
PIR also maintains privacy of the queries from database.
1.2 Image Retrieval
An image retrieval system is designed to browse, search and retrieve images from large
database of digital images. Most traditional and common methods of image retrieval
1
utilize method of adding metadata such as captioning, keywords, or descriptions to the
images so that retrieval can be performed over the annotation words. Manual image
annotation is time-consuming, laborious and expensive; to address this, there has been a
large amount of research done on automatic image annotation [2]. Annotated images are
searched based on Images’s matadata.
Image Meta search: Search of images based on associated metadata such as keywords,
text, etc.
Content-based image retrieval (CBIR): The application of computer vision to the
image retrieval. CBIR aims at avoiding the use of textual descriptions and instead re-
trieves images based on similarities in their contents (textures, colors, shapes etc.) to a
user-supplied query image or user-specified image features.
1.2.1 Content Based Image Retrieval
Since 1970’s, image retrieval has been an active research area. In the beginning, research
was concentrated to text based search only. It was new framework at that time, which
used names of image files as a search criteria. In this framework, images were need to
be annotated first and then from database management system, images were retrieved
[5]. This framework was having two limitations firstly size of image data that can fit into
database and secondly extensive manual annotation work. There was need to do research
work on retrieval which would not be having limitation of manual entry and big size data.
It lead to Content Based Image Retrieval technique.
Content-based image retrieval (CBIR), also known as query by image content (QBIC)
[6]. Content based image retrieval (CBIR) is a technique in which content of an image
is used as matching criteria instead of image’s metadata such as keywords, tags, or any
name associated with image. This provides most approximate match as compared to text
based image retrieval [7], [8], [9], [10].
2
The term content’ in this context might refer to colors, shapes, textures, or any other
information that can be derived from the image itself [6], [8], [10]. Image content can
be categorized into visual and semantic contents. Visual content can be very general or
domain specific. General visual content include color, texture, shape, spatial relationship,
etc. Domain specific visual content, like human faces, is application dependent and may
involve domain knowledge. Semantic content is obtained either by textual annotation or
by complex inference procedures based on visual content [8].
COLOR: Image retrieval based on color actually means retrieval on color descriptors.
Most commonly used color descriptors are the color histogram, color coherence vector,
color correlogram, and color moments [9]. A color histogram identifies the proportion
of pixels within an image holding specific values which can be used to find similarity
between two images by using similarity distance measures [8], [11].It tries to identify
color proportion by region and is independent of image size, format or orientation [6].
Color moments contains calculation of the first order (mean), the second (variance) and
the third order (skewness) of an image [8].
TEXTURE : Texture of an image is actually visual patterns that an image possesses and
how they are spatially defined. Textures are represented by texels which are then placed
into a number of sets, depending on how many textures are detected in the image. These
sets not only define the texture, but also where in the image the texture is located. Sta-
tistical method such as co-occurrence Matrices can be used for quantitative measure of
the arrangement of intensities in a region [8], [12].
SHAPE :Shape in image does not mean shape of an image but it means that shape of
a particular region or an object. Segmentation and edge detection are prominent tech-
niques that can be used in shape detection [6].
There are several components that are calculated from image content such as color inten-
sity, entropy, mean of image etc. which are useful in creation of feature vector. Feature
vector of every image is calculated and is stored in database. When user wants to retrieve
set of images, he queries system through an image. Feature vector of required image is
3
matched with vectors of stored images and images with vector similar to vector of queried
image are retrieved. User will select required image from set of shown images.
1.3 Similarity Measurement
A similarity measurement is always selected to find how similar the two vectors are. The
problem can be converted to computing the discrepancy between two vectors x,y ∈ Rd.
There are three distance measurements: Euclidean, Mahalanobis, and chord distances
which are reviewed as follows. [11]
1.3.1 Euclidean Distance
The Euclidean distance between x,y ∈ Rd is computed by [11]
δ(x, y) = ‖x− y‖2 =√∑d
j=1(xj − yj)2
1.3.2 Mahalanobis Distance
The Mahalanobis distance between two vectors x and y with respect to the training
patterns xi is computed by [11]
δ(x, y)2 =√
(x− y)tS−1(x− y)
where the mean vector u and the sample covariance matrix S from the sample {xi | 1 ≤
i ≤ n} of size n are computed by [11]
S = 1n
∑ni=1(xi − u)(xi − u)t with u = 1
n
∑ni=1 xi
1.3.3 Chord Distance
The chord distance between two vectors x and y is to measure the distance between the
projected vectors of x and y onto the unit sphere, which can be computed by[11]
δ3(x, y) = ‖xr− y
s‖2, where r = ‖x‖2 and s = ‖y‖2
4
1.4 Hadoop
The Apache Hadoop is a collection open-source software projects for reliable, scalable,
distributed computing. Software libraries of Hadoop specify a framework that allows
distributed processing of large data sets across clusters of computers using simple pro-
gramming models. It is designed to scale up from single to thousands of machines, each
offering local computation and storage [13].
Some modules of Hadoop are listed below [13]:
1. Hadoop Common: The common utilities that support the other Hadoop modules.
2. Hadoop Distributed File System (HDFS): A distributed file system that provides
high-throughput access to application data.
3. Hadoop YARN : A framework for job scheduling and cluster resource management.
4. Hadoop MapReduce: A YARN-based system for parallel processing of large data
sets.
5. HBase: A scalable, distributed database that supports structured data storage for
large tables.
1.4.1 MapReduce
Map reduce is a framework for processing parallel problems across huge datasets using
a large number of commodity computers (nodes). The computation performed on the
nodes is independent of the physical location of nodes i.e. whether all nodes are on same
network and use similar hardware(called as cluster) or nodes are shared across geograph-
ically and administratively distributed systems, and use more heterogeneous hardware
(called as grid). It is also independent of location of data. Data can be stored into clus-
ter at any node.
The computation takes a set of input key/value pairs, and produces a set of output
key/value pairs. The user of the MapReduce library expresses the computation as two
5
functions: map and reduce.
Map function, written by the programmer, takes an input pair of key/value and produces
a set of intermediate key/value pairs. The MapReduce library groups together all values
associated with the same key and passes them to the Reduce function.
Reduce function, also written by the programmer, accepts pair of key/value which is
generated by map function. It merges together these values to form a possibly smaller
set of values. Typically just zero or one output value, which is actual output of the user
program.
map (k1, v1) → list(k2, v2)
reduce (k2, list(v2)) → list(k3, v3)
Computational processing can occur on data stored either in a file system (unstruc-
tured/HDFS) or in a database (structured/HBase).
Map: The master node takes the input, divides it into smaller sub-problems, and dis-
tributes them to worker nodes. A worker node may do this again in turn, leading to a
multi-level tree structure. The worker node processes the smaller problem, and passes
the answer back to its master node [14].
Reduce: The master node then collects the answers to all the sub-problems and com-
bines them in some way to form the output – the answer to the problem it was originally
trying to solve [14].
MapReduce allows for distributed processing of the map and reduction operations. Pro-
vided each mapping operation is independent of the others. All maps can be performed
in parallel – though in practice it is limited by the number of independent data sources
and/or number of CPUs near each source. Similarly, a set of ’reducers’ can perform the
reduction phase provided all outputs of the map operation that share the same key are
presented to the same reducer at the same time, or if the reduction function is associative.
6
Figure 1.1: Map Reduce Process.
Example [15]
The map function emits each word plus an associated count of occurrences (just 1 in this
simple example). The reduce function sums together all counts emitted for a particular
word.
map Word(String key file, String value line):
// key file: text file name, value line: text file contents
for each word m in value line:
EmitIntermediateval(m, “1”);
reduce Worx(String key file, Iterator value line):
// key file: a word from text file, value line: a list of counts of each word
int total count = 0;
for each word t in value line:
total count += ParseInt(t);
Emitfinalval(AsString(total count));
7
1.4.2 HDFS
The Hadoop Distributed File System (HDFS) [16] is a another module of Hadoop Project.
It is a distributed file system, which is designed to run on commodity hardware. HDFS
is highly fault-tolerant and works on low-cost hardware. HDFS provides high throughput
access to application data and is suitable for applications that have large data sets [16].
1.4.3 HBase
HBase actually works on HDFS. It is a distributed column oriented database built on
top of HDFS [17]. It is a Hadoop application which is used, when user require real-time
read/write random-access to very large datasets.
HBase can scale linearly just by adding nodes. It is not relational and does not support
SQL, but given the proper problem space, it is able to do what an RDBMS cannot: host
very large, sparsely populated tables on clusters made from commodity hardware [17].
1.5 Oblivious Technique
Oblivious retrieval technique is designed to achieve user and database privacy. Actually
Oblivious retrieval technique contains implementation of homomorphic encryption tech-
nique. This encryption technique allows user to do addition, multiplication operations
on encrypted data without changing its original plain text data. In this application ho-
momorphic encryption is done on image feature data.
Today many of the website firm outsource their data to external storage servers. In this
case, it is necessary to maintain privacy of user with respect to external server as well as
vice versa. Oblivious technique is one such attempt to maintain this type of security. In
oblivious technique privacy of user with respect to the server as well as privacy of server
with respect to user is maintained with the use of homomorphic encryption technique.
Homomorphic encryption is a asymmetric cryptography technique. It works on public
and private key pair. It allows specific types of computations to be carried out on cipher
text and obtain an encrypted result which decrypted matches the result of operations
8
performed on the plain text.
For instance, one person could add two encrypted numbers and then another person could
decrypt the result, without either of them being able to find the value of the individual
numbers.
Using the public key of user, feature vector will be stored in encrypted form into database.
When User inputs a query, it will go into database in encrypted form and images will be
searched and retrieved on the basis of similar encrypted feature vector. These encrypted
image data will be sent to user and user will decrypt that image data using his private key.
Cryptographic technique which is used in this system is Paillier Cryptosystem. In the
Paillier cryptosystem, the homomorphic property is supported as follows.
Homomorphic addition of plaintexts The product of two ciphertexts will decrypt to the
sum of their corresponding plaintexts,
D(E(m1,r1)*E(m2,r2) mod n2) = (m1 + m2)mod n
Homomorphic Encryption Algorithms:
1. Simple RSA Algorithm
2. ElGamal Cryptosystem
3. Paillier Cryptosystem
9
Chapter 2
Literature Survey
2.1 PIRMAP [1]
Private Information Retrieval (PIR) allows for retrieval of bits from a database in a way
that hides a user’s access pattern from the server. This paper presents PIRMAP, a practi-
cal, highly efficient protocol for PIR in MapReduce, a widely supported cloud computing
API. PIRMAP focuses especially on the retrieval of large files from the cloud, where it
achieves optimal communication complexity (O(1) for retrieval of an l bit file) with query
times significantly faster than previous schemes.
2.1.1 PIR
When user connects to a cloud, there is series of interactions between the user and the
server. User connects to cloud to retrieve his private information which is stored. The
information can be in the form of different files.
Now user wants to retrieve x number of files out of his n files of his choice and also
server should not learn about his choice of files. Now user of cloud has to provide such
query to cloud such that user will receive his required number of files from cloud and this
transaction should be secured.
10
2.1.2 MapReduce in PIRMAP
This information retrieval technique uses map reduce to parallelize the process which will
reduce the computational time to retrieve the file from cloud and will make it very easy
to use.
The first phase is called the “Map” phase. MapReduce will automatically split input
computations, equally among available nodes in the cloud data center, and each node
will then run a function called map on their respective pieces (called InputSplits). It
is important to note that the splitting actually occurs when the data is uploaded into
the cloud. This means that each “mapper” node will have local access to its InputSplit
as soon as computation is started and you avoid a lengthy copying and distributing pe-
riod. The map function runs a user defined computation on each InputSplit and outputs
(emits) a number of key-value pairs that go into the next phase.
The second phase, “Reduce”, takes as input all of the key-value pairs emitted by the map-
pers and sends them “reducer” nodes in the data center. Specifically, each reducer node
receives a single key, along with the sequence of values emitted by the mappers which
share that key. The reducers then take each set and combine it in some way, emitting a
single value for each key [1].
2.1.3 Working of PIRMAP
PIRMAP is an extension of the PIR protocol by Kushile-vitz and Ostrovsky, targeting
the retrieval of large files in a parallelization-aggregation computation framework such as
MapReduce. We will start by giving an overview of PIRMAP which can be used with
any additively homomorphic encryption scheme.
Upload: In the following, we assume that the cloud user has already uploaded its files
into the cloud using the interface provided to them by the cloud provider.
Query: In keeping with standard PIR notation, our data set holds n files, each of which
11
is l bits in length. There is also an additional parameter k which is the block size of the
chosen cipher. For ease of presentation, we will consider the case where all files are the
same length, but PIRMAP can easily be extended to accommodate variable length files
by padding or prepending each file with a few bytes that specify its length.
2.1.4 Future Scope/Issues
This implementation is specifically for files which contains textual data, it does not sup-
port for multimedia data such as image, video or audio files.
2.2 Map/Reduce in CBIR Application [2]
At the start, image retrieval is mainly dependent on the text based retrieval of an image.
This technology is widely used, the current mainstream search engines Google, Baidu,
Yahoo, etc. mainly used this method to search image. In this technique, name of image
file was used to compare and retrieve the mage from database.
But it has drawbacks, researchers often need to manually mark with text for all images,
and this mark text can’t objectively and accurately describe the visual information of the
images.
After the 1990s, there has been Content-Based Image Retrieval, the following unified call
CBIR, and different from TBIR, it can extract visual features from image automatically,
and then retrieval image by their visual features. Since CBIR intuitive, efficient, and can
be widely used in information retrieval, medical diagnosis, trademarks and intellectual
property protection, crime prevention and other areas, it has a very high applied value.
2.2.1 System Model
Selecting an Algorithm
The technique used in this paper uses Color feature-based image retrieval. This mainly
contains the following algorithms:
Color histogram, hue histogram, color moments, color entropy etc. [15]
12
Taking into account the color histogram algorithm is widely used, it’s feature extraction
and similarity matching are more easier, this method will be used as our target color
algorithm.
Color Histogram
Color Histogram of an image gives extensive information about its structure. It gives
number of pixels, its colors in RGB format etc. So it is easy to calculate mean, entropy,
median of an image which can be used as features of an image in case of feature extrac-
tion. This feature can be used to compare with queried image’s feature and to retrieve
all similar images from database.
Feature Calculation and Similarity Matching
Feature vector of each uploaded image is stored in database. This feature vector is match
with feature vector of an input image. Both feature vectors are used to calculate simi-
larity coefficient. This is called as Pearson correlation coefficient. Map Reduce is used in
similarity matching phase only to provide fast results.
Feature vector of input image is calculated in the query and is matched with the images
from image library and images for which images are matching are retrieved.
2.2.2 Future scope/Issues
Map Reduce technique is used at similarity matching stage of system. Map reduce tech-
nique can be efficiently used in feature extraction process by splitting image into parts
at map stage and then combining it at reduce level.
2.3 An Oblivious Image Retrieval Protocol [3]
Today many website providers have large collection of images. For e.g. google.com, face-
book.com have large database of images where each user uploads his private data to the
data center of website. This data should be protected from external or internal threat.
The threat may be of stealing of image data, modification of the data, deleting private
13
data of an individual, or misusing it for particular purpose etc.
As the size of this data is increasing day by day, companies find it hard to manage and
secure this kind of large data. Their data center servers are getting overloaded due to
huge data. To reduce strain form their storage servers they are looking for another option.
This option is of using external storage servers.
These external storage servers are maintained and managed by other companies. i.e.
data is now outsourced by the companies to other companies to reduce overhead of their
internal servers. In this case it is becomes very important to protect and secure the
data of a customer over external outsourced database servers. Oblivious Image Retrieval
Protocol comes in to picture for outsourced image databases. It is a technique to securely
query image database for required image data and retrieve matched images form database.
This retrieved data should also securely get transferred to the user.
2.3.1 System Model
System assumes that all feature vectors of all the images are already in database. This
feature data is stored in encrypted form in to the database. All query operations are done
on encrypted data. This mainly contains two parts one is privacy preserving querying
mechanism and oblivious transfer of the decryption keys.
Privacy preserving protocol of paper suggests following. First, to query image set, user
generates an encryption of the query feature vector set using a homomorphic public-key
encryption technique and sends it to the database server. The query feature vector is
distorted by the user with a constant random vector to prevent any statistical inference
by the database server. Second, the database server uses the encrypted query feature
vector and performs a homomorphic subtraction with a random feature vector. The
server performs a subtraction of the database image features with the same random
feature vector. Before sending the results back to the user, the database server performs
a permutation of the subtracted feature vectors so that the user is not able to learn the
relative indexing structure of the database images. The server includes pseudo-identifiers
14
for the permuted images to allow the user to identify his choices. Third, upon receiving
the server response, the user removes the distorting random constant from the server
response and calculates the Euclidean norm of the numerical difference the query image
feature vector and the database image feature vectors [3].
2.3.2 Future Scope/Issues
The oblivious image retrieval technique can be implemented using map reduce technique,
map reduce can be used at content based image retrieval stage.
2.4 Distributed Image Retrieval System Based on
MapReduce.[4]
A Distributed Image Retrieval System(DIRS) is a system in which images are retrieved
in a content based way, and the retrieval among massive image data storage is carried
parallely by utilizing MapReduce distributed computing model.
2.4.1 System Model
In this system, images are uploaded to HDFS in one map reduce process. This map
reduce process extracts features of images by using LIRE library and stores images and
their feature vectors to Hbase table.
Images are also retrieved parallelly. In the feature matching process, CBIR system cal-
culates the similarity between the sample image and the target images, and then returns
those images matching the sample image most closely.
Before MapReduce job is started, the sample image should be added to Distributed Cache
to enable every map task to access it. Image is read from Distributed cache, extract its
visual features, and then compare the feature vectors with feature vectors of target images
from HBase. All matched images are retrieved back to user.
15
2.4.2 Future Scope/Issues
The distributed image retrieval technique needs some secure technique to protect data.
2.5 Summary
After literature survey, I found that each paper has some future scope. I want to combine
all the three feature scopes into one application. This application will upload, search and
retrieve image files from distributed database for particular user and will perform this in
secured manner. It will eventually maintain user and data privacy. This will be carried
out by using map reduce technology which will optimize the time complexity for retrieval.
16
Chapter 3
Motivation
With the explosive growth of digital media data, there is a huge demand for new tools
and systems. It should enable average user to search, access, process, manage, author
and share these digital media contents more efficiently and more effectively. This leads
to change people’s interest from work to work with entrainment.
Examples:
Business Phones to Smartphones,
Watching movies/T.V.,
Listening music,
Social Media: connect and share,
There are different types of multimedia content present today.
A multimedia information system can store and retrieve text, 2D grey-scale and color
images, 1D time series, digitized voice or music, video and there is need to propose an
efficient, robust, secure solution to retrieve this kind of multimedia information.
We are considering image retrieval over audio or video information retrieval because today
image retrieval system has lot of applications. Image retrieval techniques have gone under
several changes and are becoming more and more advanced over the period of time. In
early years of image retrieval, images were searched on text based where each image has
to be annotated with particular text. This technique was more troublesome because of
its complexity. Today content based image retrieval technique provides extensive search
17
facility over database which retrieves all related images which are similar in content with
queried image.
The CBIR technology has been used in several applications such as fingerprint identi-
fication, biodiversity information systems, digital libraries, crime prevention, medicine,
historical research, among others. This CBIR covers wide randge of different domains.
Today many websites like google.com, facebook.com etc. has lot of images on their servers.
Many users access these websites regularly. They upload, search download images from
them. There is a need of fast and secured technique to upload, search and retrieve these
images on user demand.
3.1 Problem Definition
The purpose of this research work is to create an efficient private content based image
information upload, search and retrieval system using map reduce. The system will be
created from easily available API’s and which has support on cloud so that system can
be easily ported to cloud for use.
3.2 Scope of Research
Scope of this research is not limited to one particular domain. It spans over multiple
domains. Domains in which extensive research can be done are Private Information Re-
trieval, Content Based Image Retrieval and Oblivious transfer i.e. security. Research
papers in this area suggests content based image retrieval technique over cloud comput-
ing as well as secured transfer techniques for the images but not a single paper talks
about fast (over cloud computing) as well as secured retrieval of images form servers
using Private Information Retrieval Technique.
This system can be evolved in many ways. By changing CBIR algorithm, by using dif-
ferent encryption techniques, by using different storage methods over HDFS, this system
can be moved to new more efficient and secure level
18
3.3 Objectives
1. To implement CBIR in Map-Reduce
2. To select and implement homomorphic encryption technique for secure retrieval
3. To evaluate proposed solution
3.4 System Requirement Specifications
Hardware:
Minimum one Desktop system with Intel’s or AMD’s processor and 2Gb of RAM.
Latest technology will provide better results. i.e. Cluster of 5 nodes having Intel’s core
series processor(i3,i5,i7) of high frequency with 4Gb RAM will definitely perform better
than 5 node cluster having older processor version working on less frequency and less
RAM. Cluster should have nodes having same configuration. If it has nodes having dif-
ferent RAM capacity, processor frequency then cluster will not perform up to capacity.
Software:
1. Operating System: Any open source Linux operating system.
2. Coding Language: Hadoop map reduce technology/software Java language (JDK)
3. IDE: Eclipse environment
19
Chapter 4
System Design
4.1 Complete Architecture
Figure 4.1 shows complete architecture of the system.
Figure 4.1: System Architecture
20
It provides an web based interface to receive user’s request. After receiving request, sys-
tem performs specific operations on the images whether it is upload or retrieval. Core of
this system is divided into two phases. One phase is to upload images and other phase
is to search and retrieve similar images. Both modules are deployed on Hadoop cluster,
which follows the MapReduce paradigm.
Upload process will read input data from HDFS, will process it using map and reduce
functions, and then write output results to HDFS again. Similarly Image retrieval process
will use map and reduce functions for retrieving similar images depending upon queried
image. These images will be given to user on local file system.
4.1.1 Upload Images
Upload Images is an independent process where user can upload his image or photo in-
formation to the systems database. The process of uploading images is carried out with
the help of Map/Reduce techniques which parallelize the process of upload. In upload,
there are three sub process. First is splitting of an image, second is feature extraction
and third is encryption and storage. Upload Image process is parallelized with the help of
Map Reduce technique for each image. When user uploads his private photos or images
to the database, he might wants to upload an image or many images at a single point
of time. So if user wants to upload many images at a single point of time, he has to
provide only path of image folder where images are stored and map stage will parallelize
the upload process instead of uploading one image at a time. Architecture of upload
process can be seen in figure4.2.
For better understanding, explanation of upload process is split into three distinct phases.
These three phases actually specifies image splitting, image feature extraction and en-
cryption part.
Phase 1 : An image which is going to be uploaded is first split into 16 small images. This
step is taken to disperse data across nodes and attacker should not be able to track data
from one location. Map/Reduce technique works on its distributed file system called as
21
Figure 4.2: Upload Process
Hadoop Distributed File System(HDFS) [18], [19], [13], [16]. specifically built for par-
allelization. All image files are stored into file system. The information about all split
images of every image are stored into HBase [17] table. HBase works on top of HDFS.
Table in HBase will store path of image and other features of that image.
Phase 2 : These images are given to the image processing part. In this phase, Color His-
togram of each image is calculated. This histogram will provide different feature values
such as entropy, intensity, mean, median etc. These values are used for the similarity
calculation [11] of images i.e. distance between quired image and images from HDFS.
Phase 3 : This calculated feature will be given to homomorphic encryption system which
will store this vector in HBase table in encrypted form. Homomorphic encryption system
actually provides additive and multiplicative operations to be carried out on encrypted
data without changing its original data i.e. multiplication of all 16 encrypted feature
vectors of an particular image will lead to the summation of feature vector of image in
plaintext form.
Advantage of this will come into picture at the time of retrieval. At the time of retrieval
when user will query to the system then encrypted feature vector of quired image will be
used. This feature vector which is in encrypted form will be used against total feature
22
vector of all split image of particular image.
Algorithm used for this encryption technique is Paillier cryptosystem. It uses pair of
private and public key of user. Image feature vector of an image is encrypted with the
public key of user and is entered into HDFS. Users public key will be available with the
database server whereas private and public key will be available with user. This encrypted
feature vector is stored into files on HDFS. Each split image will have its feature vector
corresponding to it and it will be stored in same row along with encrypted image name.
4.1.2 Image Retrieval
Figure 4.3: Image Retrieval Process
This phase of the system provides user,a GUI to search and retrieve images. Here user
will upload one image for which he wants to find similar images. When user wants to
search images based on particular image, he uploads it to system. Processes included in
upload phase are also carried out on this image. Feature vector of each image is calculated
from color histogram. All feature vectors of split images are encrypted with the public
key of user. These all encrypted feature vectors of queried image are sent to server. This
procedure of searching,comparing and retrieving images is also one Map reduce process.
In this process, when user uploads image, encrypted feature vectors of queried image i.e.
feature vectors of all split image files along with all image feature vectors in HBase table
goes to Map process.
23
Map Stage: This map process multiplies all feature vectors of particular image and emits
total feature vector of image. This process does this in parallel for all images present into
system. It will create one mapper for one image and emits total feature vector after
combining 16 feature vectors.
Reduce Stage:Vectors of all images are collected in reduce step. Each vector is compared
with feature vector of queried image. Comparison is done on the basis of similarity
measures. This similarity measure is calculated from the different standard formulas
such as Euclidean Distance, Pearson correlation coefficient [11]. In proposed protocol,
euclidean distance between color histograms of images is calculated.
24
Chapter 5
Implementation of System
Graphical user interface(GUI) is most important part of any application. It allows user to
interact with a system. The success of application depends upon how well it is designed,
how simple is it to use? This application even though, is designed in bottom up fashion,
i.e. first main Map-Reduce code and lastly GUI, I will introduce GUI firstly, because
when any user will start this application, he or she will interact with GUI only.
Complete application including GUI is created in eclipse[20]. It is an open source inte-
grated development environment(IDE) which helps to develop applications easily. This
IDE comes with integrated J2EE or Java EE [21]. J2EE is actually Java to enterprise
edition technology. It is used to build enterprise web applications. Java EE includes sev-
eral API specifications, such as JDBC,Enterprise JavaBeans, Connectors, servlets, Java
Server Pages (JSPs) [12] and several web service technologies. This allows developers
to create enterprise applications that are portable and scalable, and that integrate with
legacy technologies [21].
I have created GUI with JSPs, it enables Web developers and designers to rapidly develop
and easily maintain, information-rich, dynamic Web pages. Java code can be integrated
with HTML in JSPs. JSP technology separates the user interface from content genera-
tion, enabling designers to change the overall page layout without altering the underlying
dynamic content [22]. In this application, JSPs have all GUI code and Java code contains
all business logic of the application. All map reduce processes can run only through Java
code as there are some specific commands which runs Map Reduce process.
25
This application is designed in such a way that if any process required for MapReduce
program to run is not present then, the application automatically restarts all process in a
system. Further it makes sure that all processes will be up and running. This is done only
for the local deployment of hadoop cluster because when server is restarted all data will
be formated. In actual industry deployment, this case will not arise because all precesses
will always be running.
5.1 Log In Screen
As title suggest, this application is private content based image information retrieval, so
every user will have his user id and password through which user will login to his/her
account to upload or retrieve images. Log In Screen gives user to maintain his privacy.
This is because many time databases are outsourced by the companies in that case data
of many users can be present on same server/cluster. This technique provides facility
that user can see only his data and not of others.
Figure 5.1: LogIn Screen
5.2 Create Account
Create account form comes after clicking link ”Create Account” from sign up. To access
application, user has to create his account with web site, its more like an sign up form.
26
Figure 5.2: Create Account Screen
User account is created from users email id. Validations are done on email id field i.e. if
in case user enters wrong email id, it asks for correct email id again. After valid email
id, user account will get created along with security keys. User will be notified with User
ID to login with. This User ID will create his account i.e. folder in HDFS. Information
about login i.e. user id and passwords are stored in HBase table which is also a part of
Hadoop. It is a database which allows to create table and store data in it. It is scalable
database which can expand at run time.
Figure 5.3: Account Creation Screen - Validation
After successful creation of account, user will get a message that account has been created
and it is ready to access. User now will go to LogIn screen to log in.
27
5.3 User Operation Screen
After logging in, user will see a screen having three links. One link is for logout, second
is for uploading images and third one is for retrieving similar images for one particular
image from database.
Figure 5.4: User Operations Screen
5.3.1 Changes in Hadoop
Conventional set up of Hadoop does not allow process to write output files to same output
path. User of this system will not upload files only once. He will use system any time,
any where. It is necessary for a system to upload user images at one particular location.
So when map reduce process of upload happens all encrypted feature vector must go it
same folder that belongs to user. This process will happen any time. So at each time we
can allocate new folder space to user. to remove this limitations, I have modified Hadoop
code so that output of every map reduce process will go into user specific folder and not
to other location.
Modified Files : FileOutputFormat.java, MultipleOutputFormat.java, MultipleTextOut-
putFormat.java
28
5.3.2 Upload Process
When user selects upload images link on screen, he is directed to a page where he can
specify the folder path of total images. All the images present in the folder gets uploaded
on HDFS first. These images are used in upload Map Reduce. At time of each upload,
Map-Reduce process will run. It splits an image into 16 small images, and calculates
feature vector of every small image i.e. color moment of each split image.
This feature vector is nothing but a color moment of an image calculated from color
histogram. A color histogram identifies the proportion of pixels within an image holding
specific values which can be used to find similarity between two images by using sim-
ilarity distance measures [8], [11].It tries to identify color proportion by region and is
independent of image size, format or orientation [6]. Color moments contains calculation
of the first order (mean), the second (variance) and the third order (skewness) of an image
[8]. All moments are encrypted with public key of user and are then stored into a file.
Whenever user uploads images to the database, they will go into same folder.
Actually Hadoop does not support, usage of same output folder in different Map Reduce
process. So I had modified hadoop so that in each upload Map-Reduce, files will go into
same folder.
Figure 5.5: Upload Screen
29
Map Reduce process takes some time to complete, if user does not get proper message
he might thought that images are uploaded and he might shut down the process. So
user gets a message showing that files are being uploaded. When map reduce will be
executing, user will see following screen till process completely executes.
Figure 5.6: Uploading Screen
Again there is a chance of repetition of file names. So those files who has their names
similar to files present into HDFS, will be renamed at the time of upload. First all
file names of user will be checked from HDFS. They are taken in one string and file
name of image which is going to get uploaded will be matched with the string if string
contains image name then it is renamed. Program will again check, name which is gen-
erated whether it is present in string or not if it is present then file will be renamed again.
Map Reduce Algorithm
Map reduce technique is used to create feature vector of images. Map will split input
image file in to small chunks of fixed size, will calculate its feature vector, will encrypt
it using homomorphic encryption technique and will store it into database. Reducer will
get sorted input from mapper. It will combine all split encrypted feature vector and will
emit imageid and combined list.
30
Algorithm 5.1 Upload Map Reduce Process Algorithm
Mappers emit encrypted feature vectors keyed by namesfeaturekey, the executionframework groups vectors by names along with namesfeaturekey, and the reducerswrite names and encrypted feature values to files.
imgs← new BufferedImage;random← new BigInteger;n, g, r, nsquare← new BigInteger;dMeanR← new double;dMeanG← new double;dMeanB ← new double;
1: procedure printPixelARGB(int pixel)2: dMeanR← dMeanR + Red Pixel Value3: dMeanG← dMeanG+ Green Pixel Value4: dMeanB ← dMeanB + Blue Pixel Value
1: procedure Encryption(int meanV al)2: Return BigInteger encrypted(meanV al)
1: class Mapper2: procedure Map(Imgid iId, Img im)3: m0,m1,m2← new BigInteger;4: imgs← splitImages5: for all img i ∈ imgs[count] do6: for all pixel p ∈ imgs i do7: printP ixelARGB
8: m0← Encryption(dMeanR)9: m1← Encryption(dMeanG)10: m2← Encryption(dMeanB)11: Emit(Imgid iId, encrypted vals 〈iId i,m0,m1,m2〉)
1: class Reducer2: procedure Reduce(Imgid iId, encrypted vals 〈iId i,m0,m1,m2〉)3: iIdL← new List4: for all encrypted vals 〈iId i,m〉 ∈ encrypted vals5: [〈iId 0,m0,m1,m2〉, 〈iId 1,m0,m1,m2〉 . . .]
do6: Append(iIdL, 〈iId i,m〉)7: Emit(Imgid iId, iIdL)
31
Following image shows the working of Map Reduce process in eclipse or behind the screen.
Figure 5.7: Process behind Upload
5.3.3 Image Retrieval
When user clicks search images link, he is directed to a page where he can specify the
file path of image. This is an image on the basis of which images are to be retrieved
from database. This comparison is actually done between features of images stored in
HDFS and feature of an image in search process. In upload process, feature vectors are
calculated and are stored into HDFS. These feature vectors are used for comparison.
Figure 5.8: Search Screen
32
When user selects image, it passes through several steps. These steps are actually similar
that of upload process. i.e. Image feature calculation and its encryption. Image is split
into 16 parts first same as in upload phase. This step is taken to maintain loss of pixels in
both cases equal. Features are encrypted with same public key of an user. So at similarity
matching phase features will match exactly.
At upload stage, features of split images are stored into HDFS. So at similarity measure-
ment time they should be combined back into one feature vector. Here paillier cryptosys-
tem comes into picture. Paillier provides facility to perform addition and multiplication
operations on encrypted data without changing original data.
Similarity distance is calculated between stored image feature vectors and feature vector
of an image to be searched. Similarity distance is nothing but a mathematical formula
which gives relation between the images. There are different types of measures which can
be used for calculation [8],[11]. Euclidean distance measurement is used for calculation of
similarity between images. This distance is calculated between means. This will provide
information about similarity between two images. This process of comparison is done in
an map reduce to make it parallel.
When map reduce process is going on user will see following screen.
Figure 5.9: Searching Screen
33
Background processes can be seen from following screen.
Figure 5.10: Processes behinds search Screen
Map Reduce Algorithm
Algorithm 5.2 Image Retrieval Map Reduce Process Algorithm
Mappers emit encrypted feature vectors keyed by opName featureid, the execution
framework groups vectors by names along with opName featureid, combiner gets
encrypted feature vectors of quired image and calculates difference between combined
values from datastore and vectors of quired images also calculates square and emits
values. The reducers gets values of difference write names and square root of encrypted
feature values to files.
upload img m1← new BigInteger;
upload img m2← new BigInteger;
upload img m3← new BigInteger;
34
1: class Mapper2: procedure Map(LongWritable key,Text value)3: op name← new String;4: m1,m2,m3← new BigInteger;5: line← line from file;6: arr tokens← tokenizer(line);7: while arr tokens has tokens do8: m1← arr tokens.nexttoken;9: m2← arr tokens.nexttoken;10: m3← arr tokens.nexttoken;11: Emit(LongWritable opName m1, encrypted vals m1);12: Emit(LongWritable opName m2, encrypted vals m2);13: Emit(LongWritable opName m3, encrypted vals m3);
1: class Combiner2: procedure Configure(JobConf conf)3: upload img m1← conf.mean of upload1;4: upload img m2← conf.mean of upload2;5: upload img m3← conf.mean of upload3;
6: procedure Reduce(Text key,Text value)7: subtraction of features← new Integer;8: va← new Integer;9: while va ∈ values do10: va← va+ va;
11: if key contains m1 then12: subtraction of features← va− upload img m1;
13: if key contains m2 then14: subtraction of features← va− upload img m2;
15: if key contains m3 then16: subtraction of features← va− upload img m3;
17: Emit(LongWritable opName, subtracted encrypted vals subtraction of features);
1: class Reducer2: procedure Reduce(Text key,Text value)3: final root← textrmnewDouble;4: va← textrmnewInteger;5: while va ∈ values do6: va← va+ va;
7: final root← root(va);8: Emit(LongWritable opName, subtracted encrypted vals final root);
35
Chapter 6
Experiments, Testing and Result
Analysis
6.1 Cluster Configuration
Application which is created, is actually a web based application. so their is a need of an
web server which will run application on system.
Apache Tomcat is an open source web server and servlet container developed by the
Apache Software Foundation (ASF). Tomcat implements the Java Servlet and the JavaServer
Pages (JSP) specifications from Sun Microsystems, and provides a ”pure Java” HTTP
web server environment for Java code to run [23], [24].
Initially, Hadoop was installed on single system i.e. on single node cluster which is cru-
cially used for application development. After completion of development, single node
cluster is gradually increased to two, three, four and then five node cluster.
In single node cluster, master node and slave nodes are same i.e. only one node where
as it multi node cluster there is one master and other slaves. More master nodes can be
possible. Each node of the cluster has installation of Hadoop’s Map Reduce. But HBase
is limited to master only which is used to store users login credentials.
Configuration of each node and the role it plays in the cluster is shown in Table. At
36
Master node all major processes will be running. These are Namenode process, Datanode
process, Jobtracker, Tasktracker and secondary namenode process and HMaster process
of Hbase. Slave nodes will only have Datanode, Jobtracker and Tasktracker running.
One more small cluster were created of 3 nodes just to check performance issue.
No-des
Descr-iption
Role Me-moryGb
CPU CPUFreq.GHz
DiskGb
OS
1 Laptop Single NodeCluster
4 Intel Corei3-370M
2.4 40 Ubuntu12.04
1 DesktopWorkstation
1st Node-Cluster(Master)
4 DualCoreAMDOptron
1.8 48 Ubuntu12.04
2 DesktopWorkstation
2nd Node-Cluster(Slave)
2 DualCoreAMDOptron
2.6 50 Ubuntu12.04
3 DesktopWorkstation
3rd Node-Cluster(Slave)
2 DuaCoreAMDOptron
2 60 Ubuntu12.04
4 DesktopWorkstation
4th Node-Cluster(Slave)
4 DualCoreAMDOptron
1.8 150 Ubuntu12.04
5 DesktopWorkstation
5th Node-Cluster(Slave)
4 DualCoreAMDOptron
2 50 Ubuntu12.04
6 DesktopWorkstation
6th Node-Cluster(Slave)
2 DualCoreAMDOptron
1.8 150 Ubuntu12.04
Table 6.1: Cluster Configuration
No-des
Descr-iption
Role Me-moryGb
CPU CPUFreq.GHz
DiskGb
OS
3 Desktop 1-Master2-Slave
4 Core i5 3.10 Ubuntu12.04
Table 6.2: Second Cluster Configuration
37
6.2 Experiments and Testing
For doing experiments, you need database in this application case image database. Ini-
tially experiments were done on static data When application main processes were de-
veloped i.e. Map Reduce for Upload and retrieval, it was tested on image data set of
10 to 15 files. First feature vectors of these images were created through java code and
uploaded to HDFS. That data set was used into processes. Then code was written to
automatically upload files to HDFS and perform all operations on encrypted image files
directly.
Experiments are carried out at upload and retrieval stage with a repository of images.
These images are stored in data set of sizes 10, 20,.. to 100 then 200, 300,....1000 ,2000
,.....5000. Many of them were repeated. All having same file format .jpg and same size
640 by 480 pixel.
Again experiments were done with set of 25 images having different formats like .jpg,
.gif, .bmp, .png etc. and different sizes. These processes take time to complete. These
timings are taken into consideration for performance of system. Performance of a system
is calculated at both upload and retrieval stage.
6.3 Result Analysis
At start of testing, experiments were done on two machines. Both having same single node
cluster setup of Hadoop, but having different processor, speed, RAM etc. Readings of
upload and retrieval processes were taken from both machines for different set of images.
After observation it is found that map reduce process depends upon processor version, its
speed/frequency, RAM present into system. Following result can definitely highlight this
fact about systems. Results in the figure shows comparison between performance of map
reduce upload process and retrieval process on one node cluster installed on laptop and
desktop. Configuration of both is mentioned in the above table. Figure 6.1 shows total
time taken to upload x images to the system where horizontal axis stands for number of
images x and vertical axis stand for time in milliseconds.
38
Figure 6.1: One Node Cluster Difference at Upload MapReduce
Following figure6.2 shows same experiment on retrieval image map reduce process. It
also shows same observations taken from upload process. Retrieval process takes less
time compared to upload process. Upload process contains more computations than re-
trieval process.
Figure 6.2: One Node Cluster Difference at Retrieval MapReduce
39
Similarly three node cluster difference is also calculated and can be seen from following
figures. Configuration of other cluster is shown in table 6.2. After comparison of two
three node cluster, we can clearly see that processor version, its speed, memory availability
directly affects to Map Reduce process. Figure6.3 shows change in Upload Map Reduce.
Figure 6.3: Three Node Cluster Difference at Upload MapReduce
Figure6.4 shows change in Retrieval Map Reduce.
Figure 6.4: Three Node Cluster Difference at Retrieval MapReduce
40
After taking results on one node cluster, it is gradually increased to two, three, four,
five and six node cluster. Results were taken on each cluster size and are displayed in
figure6.5.
Figure 6.5: Performance of Upload MapReduce at Cluster
Figure6.6 shows same experiment on retrieval image map reduce process.
Figure 6.6: Performance of Retrieval MapReduce at Cluster
Reading were taken by running process on gradually increasing cluster size. Retrieval
process takes less time compared to upload process. Upload process contains more com-
putations than retrieval process.
41
As you can see from figure, at cluster 3 readings are less than cluster 4. Actually at 4,
readings should be less. This happened because addition of bad node. Node which was
added, had 2 Gb of RAM, low processor speed and slow LAN connectivity. Sp reading
got increased instead of decreasing. But other readings are gradually decreasing as we go
on increasing cluster size.
Figure 6.7 and figure 6.8 shows scale up of time spend by a process. These readings
are difference between times on consecutive size clusters. These are also in milliseconds.
Negative size shows bad node cluster entry i.e. increasing cluster size from 3 to 4.
Figure 6.7: Scale Up between Different Cluster Nodes
Figure 6.8: Scale Up between Different Cluster Nodes
42
6.4 Error Check in Application
Following screens shows different error checking schemes throughout the application life
cycle.
Figure 6.9: Error Checking in Email
This error checking is done at upload screen, when user types folder path at which folder
is not present.
Figure 6.10: Error Check at Upload Images
This error checking is done at search screen, when user types image file path on basis of
which search has to get performed is not present.
43
Figure 6.11: Error check at Search Images
6.5 Comparison of Different Systems
Following table shows comparison of different commercially available image retrieval sys-
tems available. Some of them are applications and some are user libraries. Comparison
shows difference between all available system with respect to my system on some major
criteria.
Figure 6.12: Error check at Search Images
44
Chapter 7
System Output
7.1 Input Images
Following figure shows a example regarding upload of images by system user. It contains
images of different formats with variable sizes. Images shown are having formats like .jpg,
.png, .gif and .bmp. This system does not support image upload and retrieval for images
having extension .tiff.
Following is set of images uploaded to system.
Figure 7.1: Input set of Images Uploaded
Image shown is an image on basis of which similar images are retrieved.
45
Figure 7.2: Input Search Image
Following is an output of a system. Depending upon mean value of image all similar
images from database are retrieved. System outputs all images which are having mean
values different from mean of input image. Five variations from input mean is considered
in the system.
Figure 7.3: Output of System
46
Chapter 8
Conclusions and Future Scope
8.1 Conclusion
We experimented and evaluated the proposed system with three aspects. One is its sim-
ilarity retrieval, second is execution time complexity and third is security.
On similarity measurement front, application is successful in uploading, searching and
retrieving similar images by precision of nearly 89 to 94%, for same format and same
dimension of image data. Variation in dimension varies the output similarity precision
by 20%. But application does not get affected by format change and delivers output with
same high similarity.
For execution time complexity, application is tested on desktop systems with different
configurations. Systems with latest processors and memory deliver high performance and
increases time efficiency nearly by 50%. Throughput of the system increased due to use
of latest core processors having high frequency. Main memory also played crucial role in
increasing performance.
Paillier cryptosystem, which is used in this application is highly secured. It is actually
used to achieve user privacy and database privacy with respect to each other. This ap-
plication system achieves, high security because retrieval process is completely done on
encrypted data which is secured by using random key of user. So it does not lag in se-
curity frame. But it can be made more secure by using different keys, provided by some
47
secured key management servers.
Although system has provided satisfactory results, we consider this as just a first step to
provide internet/cloud users, a Private CBIR system for their use and it opens avenues
for further research and developments in this.
8.2 Future Scope
This application is designed in such a way that it will try to cover three different domains.
It does not use detailed research work from any of the domain. So there is lot of scope
to improve this application form different domain’s perspective. This application can be
evolved in Private Information Retrieval, Content Based Image Retrieval and Oblivious
transfer i.e. security domains.
Different homomorphic encryption technique can be used, similarity measurement can
be changed to improve accuracy, also different feature extraction techniques like Tamura
features, Shape features can be used along with color histogram to increase the accuracy
of similarity measurement.
This application has large scope in cloud domain and internet world because every do-
main servers are using cloud technology and they want to provide their customers a new
cutting edge applications for imaging which are secure, efficient and delivers with high
throughput.
48
Appendix A
Publication Status
Title Journal StatusPrivate Image Information Re-trieval using Map/Reduce onCloud
International Journal of Ad-vances in Management, Tech-nology and Engineering Sciences[ISSN 2249-7455]
Published
49
Bibliography
[1] T. Mayberry, E.-O. Blass, and A. H. Chan, “Pirmap: Efficient private information
retrieval for mapreduce,” 2012, http://eprint.iacr.org/.
[2] L. Shi, B. Wu, B. Wang, and X. Yan, “Map/reduce in cbir application,” in Com-
puter Science and Network Technology (ICCSNT), 2011 International Conference
on, vol. 4, dec. 2011, pp. 2465 –2468.
[3] P. R. Sabbu, U. Ganugula, S. Kannan, and B. Bezawada, “An oblivious image
retrieval protocol,” in Proceedings of the 2011 IEEE Workshops of International
Conference on Advanced Information Networking and Applications, ser. WAINA
’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 349–354. [Online].
Available: http://dx.doi.org/10.1109/WAINA.2011.128
[4] J. Zhang, X. Liu, J. Luo, and B. Lang, “Dirs: Distributed image retrieval system
based on mapreduce,” in Pervasive Computing and Applications (ICPCA), 2010 5th
International Conference on, 2010, pp. 93–98.
[5] Y. Rui and T. S. Huang, “Image retrieval: Current techniques, promising direc-
tions and open issues,” Journal of Visual Communication and Image Representation,
vol. 10, pp. 39–62, 1999.
[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani,
J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by image and video
content: the QBIC system,” Computer, vol. 28, no. 9, pp. 23–32, 1995.
[7] Oracle intermedia users guide and reference. [Online]. Available: http:
//docs.oracle.com/html/A88786 01/mm cbr.htm
50
[8] D. H. Z. Dr. Fuhui Long and P. D. D. Feng, FUNDAMENTALS OF CONTENT-
BASED IMAGE RETRIEVAL.
[9] R. B. Gulfishan Firdose Ahmed, “A study on different image retrieval techniques in
image processing.”
[10] J. Eakins, M. Graham, J. Eakins, M. Graham, and T. Franklin, “Content-based
image retrieval,” Library and Information Briefings, vol. 85, pp. 1–15, 1999.
[11] C.-C. Chen and H.-T. Chu, “Similarity measurement between images,”
in Proceedings of the 29th annual international conference on Computer
software and applications conference, ser. COMPSAC-W’05. Washington,
DC, USA: IEEE Computer Society, 2005, pp. 41–42. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1890517.1890538
[12] C. W. Niblack, R. Barber, W. Equitz, M. D. Flickner, E. H. Glasman, D. Petkovic,
P. Yanker, C. Faloutsos, and G. Taubin, “QBIC project: querying images by
content, using color, texture, and shape,” C. W. Niblack, Ed., vol. 1908, no. 1.
SPIE, 1993, pp. 173–187. [Online]. Available: http://dx.doi.org/10.1117/12.143648
[13] Apache hadoop. [Online]. Available: http://hadoop.apache.org/
[14] Mapreduce - hadoop wiki. [Online]. Available: http://wiki.apache.org/hadoop/
MapReduce
[15] T. G. Peter Mell, “The nist definition of cloud computing.”
[16] Hdfs users guide. [Online]. Available: http://hadoop.apache.org/docs/hdfs/current/
hdfs user guide.html
[17] Apache hbase. [Online]. Available: http://hbase.apache.org/
[18] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large
clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online].
Available: http://doi.acm.org/10.1145/1327452.1327492
[19] Mapreduce. [Online]. Available: http://wiki.apache.org/hadoop/MapReduce
[20] Eclipse. [Online]. Available: http://www.eclipse.org/
51
[21] Java platform, enterprise edition. [Online]. Available: http://en.wikipedia.org/wiki/
Java Platform, Enterprise Edition
[22] Java server pages overview. [Online]. Available: http://www.oracle.com/
technetwork/java/overview-138580.html
[23] Apache tomcat. [Online]. Available: http://en.wikipedia.org/wiki/Apache Tomcat
[24] Apache tomcat. [Online]. Available: http://tomcat.apache.org/
52