private content based image information retrieval using map … · 2019-05-17 · private content...

Private Content Based Image InformationRetrieval using Map-Reduce

A Dissertation

submitted in fulfillment for the award of

the degree

Master of Technology,

Computer Engineering

Submitted by

Arpit D. Dongaonkar

MIS No: 121122006

Under the guidance of

Prof. Sunil B. Mane

College of Engineering, Pune

DEPARTMENT OF COMPUTER ENGINEERING ANDINFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE-5June, 2013

DEPARTMENT OF COMPUTER ENGINEERING

AND

INFORMATION TECHNOLOGY,

COLLEGE OF ENGINEERING, PUNE

CERTIFICATE

This is to certify that the dissertation entitled

“Private Content Based Image Information Retrieval using

Map-Reduce”

has been successfully completed

by

Mr. Arpit Dilip Dongaonkar

MIS No. 121122006

as a fulfillment of End Semester evaluation of

Master of Technology.

SIGNATURE SIGNATURE

Prof Sunil B. Mane Dr. J. V. Aghav

Project Guide Head

Dept. of Computer Engineering Dept. of Computer Engineering

and Information Technology, and Information Technology,

College of Engineering Pune, College of Engineering Pune,

Shivajinagar, Pune - 5. Shivajinagar, Pune - 5.

Dedicated to

My Mother

Sunita D. Dongaonkar,

My Father

Prof. Dilip R. Dongaonkar

for their love and everything, that they have given me.......

and also to

All family members

for supporting me.......

Acknowledgements

I would like to express my sincere gratitude towards Prof. Sunil B. Mane , my re-

search guide, for his patient guidance, enthusiastic encouragement and useful critiques on

this research work. Without his invaluable guidance, this work would never have been a

successful one.

I would also like to thank Vikas Jadhav, my classmate for continuous and detail knowl-

edge sharing on Hadoop and other aspects from installation to working. I would also like

to thank all my M.Tech friends for unconditional support and help.

Last but not least, my sincere thanks to all my teachers who directly or indirectly helped

me to learn new technologies and complete my post graduation successfully.

Arpit Dilip Dongaonkar

College of Engineering, Pune

Abstract

Today, Role of Information technology is changing and this change is leading to real life

software application developments i.e. Information technology has entered every part of

our day to day life like traffic signals, power grids, our day to day financial transactions

are handled by Information technology. But this technology comes with a cost to pay for

it. As this technology is touching each and every layer of life and community, there is a

need to make this technology available to each and every layer of community.

Cloud computing is an emerging technology which reduces this cost while providing very

good and efficient information technology platform as well as service. Cloud reduces

the overload on hard disk, operating system and eventually on CPU processing. Cloud

providers provide infrastructure, platform, database etc. for you to install your operating

system, your required software’s, to store your data on their cloud. But this service comes

at a cost. Cloud providers provide it on pay per hour basis. So any application which

can be deployed on cloud must be fast, efficient and secure.

As Internet is expanding, creation of data over it is also increasing exponentially. Max-

imum part of this data contains images. Today large amount of variety image data is

produced through digital cameras, mobile phones, and photo editing softwares etc. These

images are private to particular user. This digital image data should be secured such that

it should not be accessed by others except the user.

There are some domains which uses content based image retrieval applications. These

applications should be fast enough to carry out user functionalities efficiently and secure

in order to protect user data. If application is to be deployed on cloud, then it should be

supported by cloud structure. One technology which has support from cloud and does

distributed parallel computing is “Map Reduce”.

I am proposing a system which will allow user to upload, search and retrieve some of

users personal images depending upon one particular image, securely and efficiently from

his continuously expanding image database. This system is incorporated with encryp-

tion technique for security and Map-Reduce technique for efficiently upload, search and

retrieval of images over large dataset of user images.

Proposed solution will be a system to securely upload, search and query images among

massive image data storage using content based image retrieval scheme using Map Re-

duce technique, and evaluating and comparing proposed solution with existing solution.

Keywords - Information technology, Cloud computing, secured personal data.

Contents

List of Tables iv

List of Figures vi

1 INTRODUCTION 1

1.1 Private Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Content Based Image Retrieval . . . . . . . . . . . . . . . . . . . 2

1.3 Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.1 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Mahalanobis Distance . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.3 Chord Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4.2 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.3 HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.5 Oblivious Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Literature Survey 10

2.1 PIRMAP [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 PIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 MapReduce in PIRMAP . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Working of PIRMAP . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.4 Future Scope/Issues . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Map/Reduce in CBIR Application [2] . . . . . . . . . . . . . . . . . . . . 12

2.2.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

i

2.2.2 Future scope/Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 An Oblivious Image Retrieval Protocol [3] . . . . . . . . . . . . . . . . . 13

2.3.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


2.4 Distributed Image Retrieval System Based on MapReduce.[4] . . . . . . . 15

2.4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15


2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Motivation 17

3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Scope of Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 System Requirement Specifications . . . . . . . . . . . . . . . . . . . . . 19

4 System Design 20

4.1 Complete Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.1 Upload Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.2 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Implementation of System 25

5.1 Log In Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Create Account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.3 User Operation Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3.1 Changes in Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3.2 Upload Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.3 Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6 Experiments, Testing and Result Analysis 36

6.1 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Experiments and Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.4 Error Check in Application . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.5 Comparison of Different Systems . . . . . . . . . . . . . . . . . . . . . . 44

7 System Output 45

7.1 Input Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8 Conclusions and Future Scope 47

8.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

8.2 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A Publication Status 49

List of Tables

6.1 Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.2 Second Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . 37

iv

List of Figures

1.1 Map Reduce Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Upload Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Image Retrieval Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1 LogIn Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Create Account Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Account Creation Screen - Validation . . . . . . . . . . . . . . . . . . . . 27

5.4 User Operations Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.5 Upload Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.6 Uploading Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.7 Process behind Upload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.8 Search Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.9 Searching Screen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.10 Processes behinds search Screen . . . . . . . . . . . . . . . . . . . . . . . 34

6.1 One Node Cluster Difference at Upload MapReduce . . . . . . . . . . . . 39

6.2 One Node Cluster Difference at Retrieval MapReduce . . . . . . . . . . . 39

6.3 Three Node Cluster Difference at Upload MapReduce . . . . . . . . . . . 40

6.4 Three Node Cluster Difference at Retrieval MapReduce . . . . . . . . . . 40

6.5 Performance of Upload MapReduce at Cluster . . . . . . . . . . . . . . . 41

6.6 Performance of Retrieval MapReduce at Cluster . . . . . . . . . . . . . . 41

6.7 Scale Up between Different Cluster Nodes . . . . . . . . . . . . . . . . . 42

6.8 Scale Up between Different Cluster Nodes . . . . . . . . . . . . . . . . . 42

6.9 Error Checking in Email . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

v

6.10 Error Check at Upload Images . . . . . . . . . . . . . . . . . . . . . . . . 43

6.11 Error check at Search Images . . . . . . . . . . . . . . . . . . . . . . . . 44

6.12 Error check at Search Images . . . . . . . . . . . . . . . . . . . . . . . . 44

7.1 Input set of Images Uploaded . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Input Search Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.3 Output of System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Chapter 1

INTRODUCTION

1.1 Private Information Retrieval

Today most of data over Internet lies into database servers which are connected to Inter-

net. If a user makes a query to database, he/she should get required queried information.

There are many curious users who makes query to database intentionally or uninten-

tionally to get private information of other users. Lot of research is devoted to protect

databases from these types of curious users. Private information retrieval is a technique

or a research area where security with respect to this type of intrusion has been considered.

A private information retrieval (PIR) protocol allows a user to retrieve an item from a

server in possession of a database without revealing which item he/she is retrieving.

Over period different improvements in PIR has been proposed which makes PIR faster,

more efficient and more secure. Implementation like oblivious transfer is also implemented

in PIR. This restricts the user from getting information about other database items [3].

PIR also maintains privacy of the queries from database.

1.2 Image Retrieval

An image retrieval system is designed to browse, search and retrieve images from large

database of digital images. Most traditional and common methods of image retrieval

1

utilize method of adding metadata such as captioning, keywords, or descriptions to the

images so that retrieval can be performed over the annotation words. Manual image

annotation is time-consuming, laborious and expensive; to address this, there has been a

large amount of research done on automatic image annotation [2]. Annotated images are

searched based on Images’s matadata.

Image Meta search: Search of images based on associated metadata such as keywords,

text, etc.

Content-based image retrieval (CBIR): The application of computer vision to the

image retrieval. CBIR aims at avoiding the use of textual descriptions and instead re-

trieves images based on similarities in their contents (textures, colors, shapes etc.) to a

user-supplied query image or user-specified image features.

1.2.1 Content Based Image Retrieval

Since 1970’s, image retrieval has been an active research area. In the beginning, research

was concentrated to text based search only. It was new framework at that time, which

used names of image files as a search criteria. In this framework, images were need to

be annotated first and then from database management system, images were retrieved

[5]. This framework was having two limitations firstly size of image data that can fit into

database and secondly extensive manual annotation work. There was need to do research

work on retrieval which would not be having limitation of manual entry and big size data.

It lead to Content Based Image Retrieval technique.

Content-based image retrieval (CBIR), also known as query by image content (QBIC)

[6]. Content based image retrieval (CBIR) is a technique in which content of an image

is used as matching criteria instead of image’s metadata such as keywords, tags, or any

name associated with image. This provides most approximate match as compared to text

based image retrieval [7], [8], [9], [10].

2

The term content’ in this context might refer to colors, shapes, textures, or any other

information that can be derived from the image itself [6], [8], [10]. Image content can

be categorized into visual and semantic contents. Visual content can be very general or

domain specific. General visual content include color, texture, shape, spatial relationship,

etc. Domain specific visual content, like human faces, is application dependent and may

involve domain knowledge. Semantic content is obtained either by textual annotation or

by complex inference procedures based on visual content [8].

COLOR: Image retrieval based on color actually means retrieval on color descriptors.

Most commonly used color descriptors are the color histogram, color coherence vector,

color correlogram, and color moments [9]. A color histogram identifies the proportion

of pixels within an image holding specific values which can be used to find similarity

between two images by using similarity distance measures [8], [11].It tries to identify

color proportion by region and is independent of image size, format or orientation [6].

Color moments contains calculation of the first order (mean), the second (variance) and

the third order (skewness) of an image [8].

TEXTURE : Texture of an image is actually visual patterns that an image possesses and

how they are spatially defined. Textures are represented by texels which are then placed

into a number of sets, depending on how many textures are detected in the image. These

sets not only define the texture, but also where in the image the texture is located. Sta-

tistical method such as co-occurrence Matrices can be used for quantitative measure of

the arrangement of intensities in a region [8], [12].

SHAPE :Shape in image does not mean shape of an image but it means that shape of

a particular region or an object. Segmentation and edge detection are prominent tech-

niques that can be used in shape detection [6].

There are several components that are calculated from image content such as color inten-

sity, entropy, mean of image etc. which are useful in creation of feature vector. Feature

vector of every image is calculated and is stored in database. When user wants to retrieve

set of images, he queries system through an image. Feature vector of required image is

3

matched with vectors of stored images and images with vector similar to vector of queried

image are retrieved. User will select required image from set of shown images.

1.3 Similarity Measurement

A similarity measurement is always selected to find how similar the two vectors are. The

problem can be converted to computing the discrepancy between two vectors x,y ∈ Rd.

There are three distance measurements: Euclidean, Mahalanobis, and chord distances

which are reviewed as follows. [11]

1.3.1 Euclidean Distance

The Euclidean distance between x,y ∈ Rd is computed by [11]

δ(x, y) = ‖x− y‖2 =√∑d

j=1(xj − yj)2

1.3.2 Mahalanobis Distance

The Mahalanobis distance between two vectors x and y with respect to the training

patterns xi is computed by [11]

δ(x, y)2 =√

(x− y)tS−1(x− y)

where the mean vector u and the sample covariance matrix S from the sample {xi | 1 ≤

i ≤ n} of size n are computed by [11]

S = 1n

∑ni=1(xi − u)(xi − u)t with u = 1

n

∑ni=1 xi

1.3.3 Chord Distance

The chord distance between two vectors x and y is to measure the distance between the

projected vectors of x and y onto the unit sphere, which can be computed by[11]

δ3(x, y) = ‖xr− y

s‖2, where r = ‖x‖2 and s = ‖y‖2

4

1.4 Hadoop

The Apache Hadoop is a collection open-source software projects for reliable, scalable,

distributed computing. Software libraries of Hadoop specify a framework that allows

distributed processing of large data sets across clusters of computers using simple pro-

gramming models. It is designed to scale up from single to thousands of machines, each

offering local computation and storage [13].

Some modules of Hadoop are listed below [13]:

1. Hadoop Common: The common utilities that support the other Hadoop modules.

2. Hadoop Distributed File System (HDFS): A distributed file system that provides

high-throughput access to application data.

3. Hadoop YARN : A framework for job scheduling and cluster resource management.

4. Hadoop MapReduce: A YARN-based system for parallel processing of large data

sets.

5. HBase: A scalable, distributed database that supports structured data storage for

large tables.

1.4.1 MapReduce

Map reduce is a framework for processing parallel problems across huge datasets using

a large number of commodity computers (nodes). The computation performed on the

nodes is independent of the physical location of nodes i.e. whether all nodes are on same

network and use similar hardware(called as cluster) or nodes are shared across geograph-

ically and administratively distributed systems, and use more heterogeneous hardware

(called as grid). It is also independent of location of data. Data can be stored into clus-

ter at any node.

The computation takes a set of input key/value pairs, and produces a set of output

key/value pairs. The user of the MapReduce library expresses the computation as two

5

functions: map and reduce.

Map function, written by the programmer, takes an input pair of key/value and produces

a set of intermediate key/value pairs. The MapReduce library groups together all values

associated with the same key and passes them to the Reduce function.

Reduce function, also written by the programmer, accepts pair of key/value which is

generated by map function. It merges together these values to form a possibly smaller

set of values. Typically just zero or one output value, which is actual output of the user

program.

map (k1, v1) → list(k2, v2)

reduce (k2, list(v2)) → list(k3, v3)

Computational processing can occur on data stored either in a file system (unstruc-

tured/HDFS) or in a database (structured/HBase).

Map: The master node takes the input, divides it into smaller sub-problems, and dis-

tributes them to worker nodes. A worker node may do this again in turn, leading to a

multi-level tree structure. The worker node processes the smaller problem, and passes

the answer back to its master node [14].

Reduce: The master node then collects the answers to all the sub-problems and com-

bines them in some way to form the output – the answer to the problem it was originally

trying to solve [14].

MapReduce allows for distributed processing of the map and reduction operations. Pro-

vided each mapping operation is independent of the others. All maps can be performed

in parallel – though in practice it is limited by the number of independent data sources

and/or number of CPUs near each source. Similarly, a set of ’reducers’ can perform the

reduction phase provided all outputs of the map operation that share the same key are

presented to the same reducer at the same time, or if the reduction function is associative.

6

Figure 1.1: Map Reduce Process.

Example [15]

The map function emits each word plus an associated count of occurrences (just 1 in this

simple example). The reduce function sums together all counts emitted for a particular

word.

map Word(String key file, String value line):

// key file: text file name, value line: text file contents

for each word m in value line:

EmitIntermediateval(m, “1”);

reduce Worx(String key file, Iterator value line):

// key file: a word from text file, value line: a list of counts of each word

int total count = 0;

for each word t in value line:

total count += ParseInt(t);

Emitfinalval(AsString(total count));

7

1.4.2 HDFS

The Hadoop Distributed File System (HDFS) [16] is a another module of Hadoop Project.

It is a distributed file system, which is designed to run on commodity hardware. HDFS

is highly fault-tolerant and works on low-cost hardware. HDFS provides high throughput

access to application data and is suitable for applications that have large data sets [16].

1.4.3 HBase

HBase actually works on HDFS. It is a distributed column oriented database built on

top of HDFS [17]. It is a Hadoop application which is used, when user require real-time

read/write random-access to very large datasets.

HBase can scale linearly just by adding nodes. It is not relational and does not support

SQL, but given the proper problem space, it is able to do what an RDBMS cannot: host

very large, sparsely populated tables on clusters made from commodity hardware [17].

1.5 Oblivious Technique

Oblivious retrieval technique is designed to achieve user and database privacy. Actually

Oblivious retrieval technique contains implementation of homomorphic encryption tech-

nique. This encryption technique allows user to do addition, multiplication operations

on encrypted data without changing its original plain text data. In this application ho-

momorphic encryption is done on image feature data.

Today many of the website firm outsource their data to external storage servers. In this

case, it is necessary to maintain privacy of user with respect to external server as well as

vice versa. Oblivious technique is one such attempt to maintain this type of security. In

oblivious technique privacy of user with respect to the server as well as privacy of server

with respect to user is maintained with the use of homomorphic encryption technique.

Homomorphic encryption is a asymmetric cryptography technique. It works on public

and private key pair. It allows specific types of computations to be carried out on cipher

text and obtain an encrypted result which decrypted matches the result of operations

8

performed on the plain text.

For instance, one person could add two encrypted numbers and then another person could

decrypt the result, without either of them being able to find the value of the individual

numbers.

Using the public key of user, feature vector will be stored in encrypted form into database.

When User inputs a query, it will go into database in encrypted form and images will be

searched and retrieved on the basis of similar encrypted feature vector. These encrypted

image data will be sent to user and user will decrypt that image data using his private key.

Cryptographic technique which is used in this system is Paillier Cryptosystem. In the

Paillier cryptosystem, the homomorphic property is supported as follows.

Homomorphic addition of plaintexts The product of two ciphertexts will decrypt to the

sum of their corresponding plaintexts,

D(E(m1,r1)*E(m2,r2) mod n2) = (m1 + m2)mod n

Homomorphic Encryption Algorithms:

1. Simple RSA Algorithm

2. ElGamal Cryptosystem

3. Paillier Cryptosystem

9

Chapter 2

Literature Survey

2.1 PIRMAP [1]

Private Information Retrieval (PIR) allows for retrieval of bits from a database in a way

that hides a user’s access pattern from the server. This paper presents PIRMAP, a practi-

cal, highly efficient protocol for PIR in MapReduce, a widely supported cloud computing

API. PIRMAP focuses especially on the retrieval of large files from the cloud, where it

achieves optimal communication complexity (O(1) for retrieval of an l bit file) with query

times significantly faster than previous schemes.

2.1.1 PIR

When user connects to a cloud, there is series of interactions between the user and the

server. User connects to cloud to retrieve his private information which is stored. The

information can be in the form of different files.

Now user wants to retrieve x number of files out of his n files of his choice and also

server should not learn about his choice of files. Now user of cloud has to provide such

query to cloud such that user will receive his required number of files from cloud and this

transaction should be secured.

10

2.1.2 MapReduce in PIRMAP

This information retrieval technique uses map reduce to parallelize the process which will

reduce the computational time to retrieve the file from cloud and will make it very easy

to use.

The first phase is called the “Map” phase. MapReduce will automatically split input

computations, equally among available nodes in the cloud data center, and each node

will then run a function called map on their respective pieces (called InputSplits). It

is important to note that the splitting actually occurs when the data is uploaded into

the cloud. This means that each “mapper” node will have local access to its InputSplit

as soon as computation is started and you avoid a lengthy copying and distributing pe-

riod. The map function runs a user defined computation on each InputSplit and outputs

(emits) a number of key-value pairs that go into the next phase.

The second phase, “Reduce”, takes as input all of the key-value pairs emitted by the map-

pers and sends them “reducer” nodes in the data center. Specifically, each reducer node

receives a single key, along with the sequence of values emitted by the mappers which

share that key. The reducers then take each set and combine it in some way, emitting a

single value for each key [1].

2.1.3 Working of PIRMAP

PIRMAP is an extension of the PIR protocol by Kushile-vitz and Ostrovsky, targeting

the retrieval of large files in a parallelization-aggregation computation framework such as

MapReduce. We will start by giving an overview of PIRMAP which can be used with

any additively homomorphic encryption scheme.

Upload: In the following, we assume that the cloud user has already uploaded its files

into the cloud using the interface provided to them by the cloud provider.

Query: In keeping with standard PIR notation, our data set holds n files, each of which

11

is l bits in length. There is also an additional parameter k which is the block size of the

chosen cipher. For ease of presentation, we will consider the case where all files are the

same length, but PIRMAP can easily be extended to accommodate variable length files

by padding or prepending each file with a few bytes that specify its length.

2.1.4 Future Scope/Issues

This implementation is specifically for files which contains textual data, it does not sup-

port for multimedia data such as image, video or audio files.

2.2 Map/Reduce in CBIR Application [2]

At the start, image retrieval is mainly dependent on the text based retrieval of an image.

This technology is widely used, the current mainstream search engines Google, Baidu,

Yahoo, etc. mainly used this method to search image. In this technique, name of image

file was used to compare and retrieve the mage from database.

But it has drawbacks, researchers often need to manually mark with text for all images,

and this mark text can’t objectively and accurately describe the visual information of the

images.

After the 1990s, there has been Content-Based Image Retrieval, the following unified call

CBIR, and different from TBIR, it can extract visual features from image automatically,

and then retrieval image by their visual features. Since CBIR intuitive, efficient, and can

be widely used in information retrieval, medical diagnosis, trademarks and intellectual

property protection, crime prevention and other areas, it has a very high applied value.

2.2.1 System Model

Selecting an Algorithm

The technique used in this paper uses Color feature-based image retrieval. This mainly

contains the following algorithms:

Color histogram, hue histogram, color moments, color entropy etc. [15]

12

Taking into account the color histogram algorithm is widely used, it’s feature extraction

and similarity matching are more easier, this method will be used as our target color

algorithm.

Color Histogram

Color Histogram of an image gives extensive information about its structure. It gives

number of pixels, its colors in RGB format etc. So it is easy to calculate mean, entropy,

median of an image which can be used as features of an image in case of feature extrac-

tion. This feature can be used to compare with queried image’s feature and to retrieve

all similar images from database.

Feature Calculation and Similarity Matching

Feature vector of each uploaded image is stored in database. This feature vector is match

with feature vector of an input image. Both feature vectors are used to calculate simi-

larity coefficient. This is called as Pearson correlation coefficient. Map Reduce is used in

similarity matching phase only to provide fast results.

Feature vector of input image is calculated in the query and is matched with the images

from image library and images for which images are matching are retrieved.

2.2.2 Future scope/Issues

Map Reduce technique is used at similarity matching stage of system. Map reduce tech-

nique can be efficiently used in feature extraction process by splitting image into parts

at map stage and then combining it at reduce level.

2.3 An Oblivious Image Retrieval Protocol [3]

Today many website providers have large collection of images. For e.g. google.com, face-

book.com have large database of images where each user uploads his private data to the

data center of website. This data should be protected from external or internal threat.

The threat may be of stealing of image data, modification of the data, deleting private

13

data of an individual, or misusing it for particular purpose etc.

As the size of this data is increasing day by day, companies find it hard to manage and

secure this kind of large data. Their data center servers are getting overloaded due to

huge data. To reduce strain form their storage servers they are looking for another option.

This option is of using external storage servers.

These external storage servers are maintained and managed by other companies. i.e.

data is now outsourced by the companies to other companies to reduce overhead of their

internal servers. In this case it is becomes very important to protect and secure the

data of a customer over external outsourced database servers. Oblivious Image Retrieval

Protocol comes in to picture for outsourced image databases. It is a technique to securely

query image database for required image data and retrieve matched images form database.

This retrieved data should also securely get transferred to the user.

2.3.1 System Model

System assumes that all feature vectors of all the images are already in database. This

feature data is stored in encrypted form in to the database. All query operations are done

on encrypted data. This mainly contains two parts one is privacy preserving querying

mechanism and oblivious transfer of the decryption keys.

Privacy preserving protocol of paper suggests following. First, to query image set, user

generates an encryption of the query feature vector set using a homomorphic public-key

encryption technique and sends it to the database server. The query feature vector is

distorted by the user with a constant random vector to prevent any statistical inference

by the database server. Second, the database server uses the encrypted query feature

vector and performs a homomorphic subtraction with a random feature vector. The

server performs a subtraction of the database image features with the same random

feature vector. Before sending the results back to the user, the database server performs

a permutation of the subtracted feature vectors so that the user is not able to learn the

relative indexing structure of the database images. The server includes pseudo-identifiers

14

for the permuted images to allow the user to identify his choices. Third, upon receiving

the server response, the user removes the distorting random constant from the server

response and calculates the Euclidean norm of the numerical difference the query image

feature vector and the database image feature vectors [3].


The oblivious image retrieval technique can be implemented using map reduce technique,

map reduce can be used at content based image retrieval stage.

2.4 Distributed Image Retrieval System Based on

MapReduce.[4]

A Distributed Image Retrieval System(DIRS) is a system in which images are retrieved

in a content based way, and the retrieval among massive image data storage is carried

parallely by utilizing MapReduce distributed computing model.

2.4.1 System Model

In this system, images are uploaded to HDFS in one map reduce process. This map

reduce process extracts features of images by using LIRE library and stores images and

their feature vectors to Hbase table.

Images are also retrieved parallelly. In the feature matching process, CBIR system cal-

culates the similarity between the sample image and the target images, and then returns

those images matching the sample image most closely.

Before MapReduce job is started, the sample image should be added to Distributed Cache

to enable every map task to access it. Image is read from Distributed cache, extract its

visual features, and then compare the feature vectors with feature vectors of target images

from HBase. All matched images are retrieved back to user.

15


The distributed image retrieval technique needs some secure technique to protect data.

2.5 Summary

After literature survey, I found that each paper has some future scope. I want to combine

all the three feature scopes into one application. This application will upload, search and

retrieve image files from distributed database for particular user and will perform this in

secured manner. It will eventually maintain user and data privacy. This will be carried

out by using map reduce technology which will optimize the time complexity for retrieval.

16

Chapter 3

Motivation

With the explosive growth of digital media data, there is a huge demand for new tools

and systems. It should enable average user to search, access, process, manage, author

and share these digital media contents more efficiently and more effectively. This leads

to change people’s interest from work to work with entrainment.

Examples:

Business Phones to Smartphones,

Watching movies/T.V.,

Listening music,

Social Media: connect and share,

There are different types of multimedia content present today.

A multimedia information system can store and retrieve text, 2D grey-scale and color

images, 1D time series, digitized voice or music, video and there is need to propose an

efficient, robust, secure solution to retrieve this kind of multimedia information.

We are considering image retrieval over audio or video information retrieval because today

image retrieval system has lot of applications. Image retrieval techniques have gone under

several changes and are becoming more and more advanced over the period of time. In

early years of image retrieval, images were searched on text based where each image has

to be annotated with particular text. This technique was more troublesome because of

its complexity. Today content based image retrieval technique provides extensive search

17

facility over database which retrieves all related images which are similar in content with

queried image.

The CBIR technology has been used in several applications such as fingerprint identi-

fication, biodiversity information systems, digital libraries, crime prevention, medicine,

historical research, among others. This CBIR covers wide randge of different domains.

Today many websites like google.com, facebook.com etc. has lot of images on their servers.

Many users access these websites regularly. They upload, search download images from

them. There is a need of fast and secured technique to upload, search and retrieve these

images on user demand.

3.1 Problem Definition

The purpose of this research work is to create an efficient private content based image

information upload, search and retrieval system using map reduce. The system will be

created from easily available API’s and which has support on cloud so that system can

be easily ported to cloud for use.

3.2 Scope of Research

Scope of this research is not limited to one particular domain. It spans over multiple

domains. Domains in which extensive research can be done are Private Information Re-

trieval, Content Based Image Retrieval and Oblivious transfer i.e. security. Research

papers in this area suggests content based image retrieval technique over cloud comput-

ing as well as secured transfer techniques for the images but not a single paper talks

about fast (over cloud computing) as well as secured retrieval of images form servers

using Private Information Retrieval Technique.

This system can be evolved in many ways. By changing CBIR algorithm, by using dif-

ferent encryption techniques, by using different storage methods over HDFS, this system

can be moved to new more efficient and secure level

18

3.3 Objectives

1. To implement CBIR in Map-Reduce

2. To select and implement homomorphic encryption technique for secure retrieval

3. To evaluate proposed solution

3.4 System Requirement Specifications

Hardware:

Minimum one Desktop system with Intel’s or AMD’s processor and 2Gb of RAM.

Latest technology will provide better results. i.e. Cluster of 5 nodes having Intel’s core

series processor(i3,i5,i7) of high frequency with 4Gb RAM will definitely perform better

than 5 node cluster having older processor version working on less frequency and less

RAM. Cluster should have nodes having same configuration. If it has nodes having dif-

ferent RAM capacity, processor frequency then cluster will not perform up to capacity.

Software:

1. Operating System: Any open source Linux operating system.

2. Coding Language: Hadoop map reduce technology/software Java language (JDK)

3. IDE: Eclipse environment

19

Chapter 4

System Design

4.1 Complete Architecture

Figure 4.1 shows complete architecture of the system.

Figure 4.1: System Architecture

20

It provides an web based interface to receive user’s request. After receiving request, sys-

tem performs specific operations on the images whether it is upload or retrieval. Core of

this system is divided into two phases. One phase is to upload images and other phase

is to search and retrieve similar images. Both modules are deployed on Hadoop cluster,

which follows the MapReduce paradigm.

Upload process will read input data from HDFS, will process it using map and reduce

functions, and then write output results to HDFS again. Similarly Image retrieval process

will use map and reduce functions for retrieving similar images depending upon queried

image. These images will be given to user on local file system.

4.1.1 Upload Images

Upload Images is an independent process where user can upload his image or photo in-

formation to the systems database. The process of uploading images is carried out with

the help of Map/Reduce techniques which parallelize the process of upload. In upload,

there are three sub process. First is splitting of an image, second is feature extraction

and third is encryption and storage. Upload Image process is parallelized with the help of

Map Reduce technique for each image. When user uploads his private photos or images

to the database, he might wants to upload an image or many images at a single point

of time. So if user wants to upload many images at a single point of time, he has to

provide only path of image folder where images are stored and map stage will parallelize

the upload process instead of uploading one image at a time. Architecture of upload

process can be seen in figure4.2.

For better understanding, explanation of upload process is split into three distinct phases.

These three phases actually specifies image splitting, image feature extraction and en-

cryption part.

Phase 1 : An image which is going to be uploaded is first split into 16 small images. This

step is taken to disperse data across nodes and attacker should not be able to track data

from one location. Map/Reduce technique works on its distributed file system called as

21

Figure 4.2: Upload Process

Hadoop Distributed File System(HDFS) [18], [19], [13], [16]. specifically built for par-

allelization. All image files are stored into file system. The information about all split

images of every image are stored into HBase [17] table. HBase works on top of HDFS.

Table in HBase will store path of image and other features of that image.

Phase 2 : These images are given to the image processing part. In this phase, Color His-

togram of each image is calculated. This histogram will provide different feature values

such as entropy, intensity, mean, median etc. These values are used for the similarity

calculation [11] of images i.e. distance between quired image and images from HDFS.

Phase 3 : This calculated feature will be given to homomorphic encryption system which

will store this vector in HBase table in encrypted form. Homomorphic encryption system

actually provides additive and multiplicative operations to be carried out on encrypted

data without changing its original data i.e. multiplication of all 16 encrypted feature

vectors of an particular image will lead to the summation of feature vector of image in

plaintext form.

Advantage of this will come into picture at the time of retrieval. At the time of retrieval

when user will query to the system then encrypted feature vector of quired image will be

used. This feature vector which is in encrypted form will be used against total feature

22

vector of all split image of particular image.

Algorithm used for this encryption technique is Paillier cryptosystem. It uses pair of

private and public key of user. Image feature vector of an image is encrypted with the

public key of user and is entered into HDFS. Users public key will be available with the

database server whereas private and public key will be available with user. This encrypted

feature vector is stored into files on HDFS. Each split image will have its feature vector

corresponding to it and it will be stored in same row along with encrypted image name.

4.1.2 Image Retrieval

Figure 4.3: Image Retrieval Process

This phase of the system provides user,a GUI to search and retrieve images. Here user

will upload one image for which he wants to find similar images. When user wants to

search images based on particular image, he uploads it to system. Processes included in

upload phase are also carried out on this image. Feature vector of each image is calculated

from color histogram. All feature vectors of split images are encrypted with the public

key of user. These all encrypted feature vectors of queried image are sent to server. This

procedure of searching,comparing and retrieving images is also one Map reduce process.

In this process, when user uploads image, encrypted feature vectors of queried image i.e.

feature vectors of all split image files along with all image feature vectors in HBase table

goes to Map process.

23

Map Stage: This map process multiplies all feature vectors of particular image and emits

total feature vector of image. This process does this in parallel for all images present into

system. It will create one mapper for one image and emits total feature vector after

combining 16 feature vectors.

Reduce Stage:Vectors of all images are collected in reduce step. Each vector is compared

with feature vector of queried image. Comparison is done on the basis of similarity

measures. This similarity measure is calculated from the different standard formulas

such as Euclidean Distance, Pearson correlation coefficient [11]. In proposed protocol,

euclidean distance between color histograms of images is calculated.

24

Chapter 5

Implementation of System

Graphical user interface(GUI) is most important part of any application. It allows user to

interact with a system. The success of application depends upon how well it is designed,

how simple is it to use? This application even though, is designed in bottom up fashion,

i.e. first main Map-Reduce code and lastly GUI, I will introduce GUI firstly, because

when any user will start this application, he or she will interact with GUI only.

Complete application including GUI is created in eclipse[20]. It is an open source inte-

grated development environment(IDE) which helps to develop applications easily. This

IDE comes with integrated J2EE or Java EE [21]. J2EE is actually Java to enterprise

edition technology. It is used to build enterprise web applications. Java EE includes sev-

eral API specifications, such as JDBC,Enterprise JavaBeans, Connectors, servlets, Java

Server Pages (JSPs) [12] and several web service technologies. This allows developers

to create enterprise applications that are portable and scalable, and that integrate with

legacy technologies [21].

I have created GUI with JSPs, it enables Web developers and designers to rapidly develop

and easily maintain, information-rich, dynamic Web pages. Java code can be integrated

with HTML in JSPs. JSP technology separates the user interface from content genera-

tion, enabling designers to change the overall page layout without altering the underlying

dynamic content [22]. In this application, JSPs have all GUI code and Java code contains

all business logic of the application. All map reduce processes can run only through Java

code as there are some specific commands which runs Map Reduce process.

25

This application is designed in such a way that if any process required for MapReduce

program to run is not present then, the application automatically restarts all process in a

system. Further it makes sure that all processes will be up and running. This is done only

for the local deployment of hadoop cluster because when server is restarted all data will

be formated. In actual industry deployment, this case will not arise because all precesses

will always be running.

5.1 Log In Screen

As title suggest, this application is private content based image information retrieval, so

every user will have his user id and password through which user will login to his/her

account to upload or retrieve images. Log In Screen gives user to maintain his privacy.

This is because many time databases are outsourced by the companies in that case data

of many users can be present on same server/cluster. This technique provides facility

that user can see only his data and not of others.

Figure 5.1: LogIn Screen

5.2 Create Account

Create account form comes after clicking link ”Create Account” from sign up. To access

application, user has to create his account with web site, its more like an sign up form.

26

Figure 5.2: Create Account Screen

User account is created from users email id. Validations are done on email id field i.e. if

in case user enters wrong email id, it asks for correct email id again. After valid email

id, user account will get created along with security keys. User will be notified with User

ID to login with. This User ID will create his account i.e. folder in HDFS. Information

about login i.e. user id and passwords are stored in HBase table which is also a part of

Hadoop. It is a database which allows to create table and store data in it. It is scalable

database which can expand at run time.

Figure 5.3: Account Creation Screen - Validation

After successful creation of account, user will get a message that account has been created

and it is ready to access. User now will go to LogIn screen to log in.

27

5.3 User Operation Screen

After logging in, user will see a screen having three links. One link is for logout, second

is for uploading images and third one is for retrieving similar images for one particular

image from database.

Figure 5.4: User Operations Screen

5.3.1 Changes in Hadoop

Conventional set up of Hadoop does not allow process to write output files to same output

path. User of this system will not upload files only once. He will use system any time,

any where. It is necessary for a system to upload user images at one particular location.

So when map reduce process of upload happens all encrypted feature vector must go it

same folder that belongs to user. This process will happen any time. So at each time we

can allocate new folder space to user. to remove this limitations, I have modified Hadoop

code so that output of every map reduce process will go into user specific folder and not

to other location.

Modified Files : FileOutputFormat.java, MultipleOutputFormat.java, MultipleTextOut-

putFormat.java

28

5.3.2 Upload Process

When user selects upload images link on screen, he is directed to a page where he can

specify the folder path of total images. All the images present in the folder gets uploaded

on HDFS first. These images are used in upload Map Reduce. At time of each upload,

Map-Reduce process will run. It splits an image into 16 small images, and calculates

feature vector of every small image i.e. color moment of each split image.

This feature vector is nothing but a color moment of an image calculated from color

histogram. A color histogram identifies the proportion of pixels within an image holding

specific values which can be used to find similarity between two images by using sim-

ilarity distance measures [8], [11].It tries to identify color proportion by region and is

independent of image size, format or orientation [6]. Color moments contains calculation

of the first order (mean), the second (variance) and the third order (skewness) of an image

[8]. All moments are encrypted with public key of user and are then stored into a file.

Whenever user uploads images to the database, they will go into same folder.

Actually Hadoop does not support, usage of same output folder in different Map Reduce

process. So I had modified hadoop so that in each upload Map-Reduce, files will go into

same folder.

Figure 5.5: Upload Screen

29

Map Reduce process takes some time to complete, if user does not get proper message

he might thought that images are uploaded and he might shut down the process. So

user gets a message showing that files are being uploaded. When map reduce will be

executing, user will see following screen till process completely executes.

Figure 5.6: Uploading Screen

Again there is a chance of repetition of file names. So those files who has their names

similar to files present into HDFS, will be renamed at the time of upload. First all

file names of user will be checked from HDFS. They are taken in one string and file

name of image which is going to get uploaded will be matched with the string if string

contains image name then it is renamed. Program will again check, name which is gen-

erated whether it is present in string or not if it is present then file will be renamed again.

Map Reduce Algorithm

Map reduce technique is used to create feature vector of images. Map will split input

image file in to small chunks of fixed size, will calculate its feature vector, will encrypt

it using homomorphic encryption technique and will store it into database. Reducer will

get sorted input from mapper. It will combine all split encrypted feature vector and will

emit imageid and combined list.

30

Algorithm 5.1 Upload Map Reduce Process Algorithm

Mappers emit encrypted feature vectors keyed by namesfeaturekey, the executionframework groups vectors by names along with namesfeaturekey, and the reducerswrite names and encrypted feature values to files.

imgs← new BufferedImage;random← new BigInteger;n, g, r, nsquare← new BigInteger;dMeanR← new double;dMeanG← new double;dMeanB ← new double;

1: procedure printPixelARGB(int pixel)2: dMeanR← dMeanR + Red Pixel Value3: dMeanG← dMeanG+ Green Pixel Value4: dMeanB ← dMeanB + Blue Pixel Value

1: procedure Encryption(int meanV al)2: Return BigInteger encrypted(meanV al)

1: class Mapper2: procedure Map(Imgid iId, Img im)3: m0,m1,m2← new BigInteger;4: imgs← splitImages5: for all img i ∈ imgs[count] do6: for all pixel p ∈ imgs i do7: printP ixelARGB

8: m0← Encryption(dMeanR)9: m1← Encryption(dMeanG)10: m2← Encryption(dMeanB)11: Emit(Imgid iId, encrypted vals 〈iId i,m0,m1,m2〉)

1: class Reducer2: procedure Reduce(Imgid iId, encrypted vals 〈iId i,m0,m1,m2〉)3: iIdL← new List4: for all encrypted vals 〈iId i,m〉 ∈ encrypted vals5: [〈iId 0,m0,m1,m2〉, 〈iId 1,m0,m1,m2〉 . . .]

do6: Append(iIdL, 〈iId i,m〉)7: Emit(Imgid iId, iIdL)

31

Following image shows the working of Map Reduce process in eclipse or behind the screen.

Figure 5.7: Process behind Upload

5.3.3 Image Retrieval

When user clicks search images link, he is directed to a page where he can specify the

file path of image. This is an image on the basis of which images are to be retrieved

from database. This comparison is actually done between features of images stored in

HDFS and feature of an image in search process. In upload process, feature vectors are

calculated and are stored into HDFS. These feature vectors are used for comparison.

Figure 5.8: Search Screen

32

When user selects image, it passes through several steps. These steps are actually similar

that of upload process. i.e. Image feature calculation and its encryption. Image is split

into 16 parts first same as in upload phase. This step is taken to maintain loss of pixels in

both cases equal. Features are encrypted with same public key of an user. So at similarity

matching phase features will match exactly.

At upload stage, features of split images are stored into HDFS. So at similarity measure-

ment time they should be combined back into one feature vector. Here paillier cryptosys-

tem comes into picture. Paillier provides facility to perform addition and multiplication

operations on encrypted data without changing original data.

Similarity distance is calculated between stored image feature vectors and feature vector

of an image to be searched. Similarity distance is nothing but a mathematical formula

which gives relation between the images. There are different types of measures which can

be used for calculation [8],[11]. Euclidean distance measurement is used for calculation of

similarity between images. This distance is calculated between means. This will provide

information about similarity between two images. This process of comparison is done in

an map reduce to make it parallel.

When map reduce process is going on user will see following screen.

Figure 5.9: Searching Screen

33

Background processes can be seen from following screen.

Figure 5.10: Processes behinds search Screen

Map Reduce Algorithm

Algorithm 5.2 Image Retrieval Map Reduce Process Algorithm

Mappers emit encrypted feature vectors keyed by opName featureid, the execution

framework groups vectors by names along with opName featureid, combiner gets

encrypted feature vectors of quired image and calculates difference between combined

values from datastore and vectors of quired images also calculates square and emits

values. The reducers gets values of difference write names and square root of encrypted

feature values to files.

upload img m1← new BigInteger;



34

1: class Mapper2: procedure Map(LongWritable key,Text value)3: op name← new String;4: m1,m2,m3← new BigInteger;5: line← line from file;6: arr tokens← tokenizer(line);7: while arr tokens has tokens do8: m1← arr tokens.nexttoken;9: m2← arr tokens.nexttoken;10: m3← arr tokens.nexttoken;11: Emit(LongWritable opName m1, encrypted vals m1);12: Emit(LongWritable opName m2, encrypted vals m2);13: Emit(LongWritable opName m3, encrypted vals m3);

1: class Combiner2: procedure Configure(JobConf conf)3: upload img m1← conf.mean of upload1;4: upload img m2← conf.mean of upload2;5: upload img m3← conf.mean of upload3;

6: procedure Reduce(Text key,Text value)7: subtraction of features← new Integer;8: va← new Integer;9: while va ∈ values do10: va← va+ va;

11: if key contains m1 then12: subtraction of features← va− upload img m1;



17: Emit(LongWritable opName, subtracted encrypted vals subtraction of features);

1: class Reducer2: procedure Reduce(Text key,Text value)3: final root← textrmnewDouble;4: va← textrmnewInteger;5: while va ∈ values do6: va← va+ va;

7: final root← root(va);8: Emit(LongWritable opName, subtracted encrypted vals final root);

35

Chapter 6

Experiments, Testing and Result

Analysis

6.1 Cluster Configuration

Application which is created, is actually a web based application. so their is a need of an

web server which will run application on system.

Apache Tomcat is an open source web server and servlet container developed by the

Apache Software Foundation (ASF). Tomcat implements the Java Servlet and the JavaServer

Pages (JSP) specifications from Sun Microsystems, and provides a ”pure Java” HTTP

web server environment for Java code to run [23], [24].

Initially, Hadoop was installed on single system i.e. on single node cluster which is cru-

cially used for application development. After completion of development, single node

cluster is gradually increased to two, three, four and then five node cluster.

In single node cluster, master node and slave nodes are same i.e. only one node where

as it multi node cluster there is one master and other slaves. More master nodes can be

possible. Each node of the cluster has installation of Hadoop’s Map Reduce. But HBase

is limited to master only which is used to store users login credentials.

Configuration of each node and the role it plays in the cluster is shown in Table. At

36

Master node all major processes will be running. These are Namenode process, Datanode

process, Jobtracker, Tasktracker and secondary namenode process and HMaster process

of Hbase. Slave nodes will only have Datanode, Jobtracker and Tasktracker running.

One more small cluster were created of 3 nodes just to check performance issue.

No-des

Descr-iption

Role Me-moryGb

CPU CPUFreq.GHz

DiskGb

OS

1 Laptop Single NodeCluster

4 Intel Corei3-370M

2.4 40 Ubuntu12.04

1 DesktopWorkstation

1st Node-Cluster(Master)

4 DualCoreAMDOptron

1.8 48 Ubuntu12.04


2nd Node-Cluster(Slave)

2 DualCoreAMDOptron

2.6 50 Ubuntu12.04


3rd Node-Cluster(Slave)

2 DuaCoreAMDOptron

2 60 Ubuntu12.04


4th Node-Cluster(Slave)

4 DualCoreAMDOptron

1.8 150 Ubuntu12.04



4 DualCoreAMDOptron

2 50 Ubuntu12.04



2 DualCoreAMDOptron

1.8 150 Ubuntu12.04

Table 6.1: Cluster Configuration

No-des

Descr-iption

Role Me-moryGb

CPU CPUFreq.GHz

DiskGb

OS

3 Desktop 1-Master2-Slave

4 Core i5 3.10 Ubuntu12.04

Table 6.2: Second Cluster Configuration

37

6.2 Experiments and Testing

For doing experiments, you need database in this application case image database. Ini-

tially experiments were done on static data When application main processes were de-

veloped i.e. Map Reduce for Upload and retrieval, it was tested on image data set of

10 to 15 files. First feature vectors of these images were created through java code and

uploaded to HDFS. That data set was used into processes. Then code was written to

automatically upload files to HDFS and perform all operations on encrypted image files

directly.

Experiments are carried out at upload and retrieval stage with a repository of images.

These images are stored in data set of sizes 10, 20,.. to 100 then 200, 300,....1000 ,2000

,.....5000. Many of them were repeated. All having same file format .jpg and same size

640 by 480 pixel.

Again experiments were done with set of 25 images having different formats like .jpg,

.gif, .bmp, .png etc. and different sizes. These processes take time to complete. These

timings are taken into consideration for performance of system. Performance of a system

is calculated at both upload and retrieval stage.

6.3 Result Analysis

At start of testing, experiments were done on two machines. Both having same single node

cluster setup of Hadoop, but having different processor, speed, RAM etc. Readings of

upload and retrieval processes were taken from both machines for different set of images.

After observation it is found that map reduce process depends upon processor version, its

speed/frequency, RAM present into system. Following result can definitely highlight this

fact about systems. Results in the figure shows comparison between performance of map

reduce upload process and retrieval process on one node cluster installed on laptop and

desktop. Configuration of both is mentioned in the above table. Figure 6.1 shows total

time taken to upload x images to the system where horizontal axis stands for number of

images x and vertical axis stand for time in milliseconds.

38

Figure 6.1: One Node Cluster Difference at Upload MapReduce

Following figure6.2 shows same experiment on retrieval image map reduce process. It

also shows same observations taken from upload process. Retrieval process takes less

time compared to upload process. Upload process contains more computations than re-

trieval process.

Figure 6.2: One Node Cluster Difference at Retrieval MapReduce

39

Similarly three node cluster difference is also calculated and can be seen from following

figures. Configuration of other cluster is shown in table 6.2. After comparison of two

three node cluster, we can clearly see that processor version, its speed, memory availability

directly affects to Map Reduce process. Figure6.3 shows change in Upload Map Reduce.

Figure 6.3: Three Node Cluster Difference at Upload MapReduce

Figure6.4 shows change in Retrieval Map Reduce.

Figure 6.4: Three Node Cluster Difference at Retrieval MapReduce

40

After taking results on one node cluster, it is gradually increased to two, three, four,

five and six node cluster. Results were taken on each cluster size and are displayed in

figure6.5.

Figure 6.5: Performance of Upload MapReduce at Cluster

Figure6.6 shows same experiment on retrieval image map reduce process.

Figure 6.6: Performance of Retrieval MapReduce at Cluster

Reading were taken by running process on gradually increasing cluster size. Retrieval

process takes less time compared to upload process. Upload process contains more com-

putations than retrieval process.

41

As you can see from figure, at cluster 3 readings are less than cluster 4. Actually at 4,

readings should be less. This happened because addition of bad node. Node which was

added, had 2 Gb of RAM, low processor speed and slow LAN connectivity. Sp reading

got increased instead of decreasing. But other readings are gradually decreasing as we go

on increasing cluster size.

Figure 6.7 and figure 6.8 shows scale up of time spend by a process. These readings

are difference between times on consecutive size clusters. These are also in milliseconds.

Negative size shows bad node cluster entry i.e. increasing cluster size from 3 to 4.

Figure 6.7: Scale Up between Different Cluster Nodes

Figure 6.8: Scale Up between Different Cluster Nodes

42

6.4 Error Check in Application

Following screens shows different error checking schemes throughout the application life

cycle.

Figure 6.9: Error Checking in Email

This error checking is done at upload screen, when user types folder path at which folder

is not present.

Figure 6.10: Error Check at Upload Images

This error checking is done at search screen, when user types image file path on basis of

which search has to get performed is not present.

43

Figure 6.11: Error check at Search Images

6.5 Comparison of Different Systems

Following table shows comparison of different commercially available image retrieval sys-

tems available. Some of them are applications and some are user libraries. Comparison

shows difference between all available system with respect to my system on some major

criteria.

Figure 6.12: Error check at Search Images

44

Chapter 7

System Output

7.1 Input Images

Following figure shows a example regarding upload of images by system user. It contains

images of different formats with variable sizes. Images shown are having formats like .jpg,

.png, .gif and .bmp. This system does not support image upload and retrieval for images

having extension .tiff.

Following is set of images uploaded to system.

Figure 7.1: Input set of Images Uploaded

Image shown is an image on basis of which similar images are retrieved.

45

Figure 7.2: Input Search Image

Following is an output of a system. Depending upon mean value of image all similar

images from database are retrieved. System outputs all images which are having mean

values different from mean of input image. Five variations from input mean is considered

in the system.

Figure 7.3: Output of System

46

Chapter 8

Conclusions and Future Scope

8.1 Conclusion

We experimented and evaluated the proposed system with three aspects. One is its sim-

ilarity retrieval, second is execution time complexity and third is security.

On similarity measurement front, application is successful in uploading, searching and

retrieving similar images by precision of nearly 89 to 94%, for same format and same

dimension of image data. Variation in dimension varies the output similarity precision

by 20%. But application does not get affected by format change and delivers output with

same high similarity.

For execution time complexity, application is tested on desktop systems with different

configurations. Systems with latest processors and memory deliver high performance and

increases time efficiency nearly by 50%. Throughput of the system increased due to use

of latest core processors having high frequency. Main memory also played crucial role in

increasing performance.

Paillier cryptosystem, which is used in this application is highly secured. It is actually

used to achieve user privacy and database privacy with respect to each other. This ap-

plication system achieves, high security because retrieval process is completely done on

encrypted data which is secured by using random key of user. So it does not lag in se-

curity frame. But it can be made more secure by using different keys, provided by some

47

secured key management servers.

Although system has provided satisfactory results, we consider this as just a first step to

provide internet/cloud users, a Private CBIR system for their use and it opens avenues

for further research and developments in this.

8.2 Future Scope

This application is designed in such a way that it will try to cover three different domains.

It does not use detailed research work from any of the domain. So there is lot of scope

to improve this application form different domain’s perspective. This application can be

evolved in Private Information Retrieval, Content Based Image Retrieval and Oblivious

transfer i.e. security domains.

Different homomorphic encryption technique can be used, similarity measurement can

be changed to improve accuracy, also different feature extraction techniques like Tamura

features, Shape features can be used along with color histogram to increase the accuracy

of similarity measurement.

This application has large scope in cloud domain and internet world because every do-

main servers are using cloud technology and they want to provide their customers a new

cutting edge applications for imaging which are secure, efficient and delivers with high

throughput.

48

Appendix A

Publication Status

Title Journal StatusPrivate Image Information Re-trieval using Map/Reduce onCloud

International Journal of Ad-vances in Management, Tech-nology and Engineering Sciences[ISSN 2249-7455]

Published

49

Bibliography

[1] T. Mayberry, E.-O. Blass, and A. H. Chan, “Pirmap: Efficient private information

retrieval for mapreduce,” 2012, http://eprint.iacr.org/.

[2] L. Shi, B. Wu, B. Wang, and X. Yan, “Map/reduce in cbir application,” in Com-

puter Science and Network Technology (ICCSNT), 2011 International Conference

on, vol. 4, dec. 2011, pp. 2465 –2468.

[3] P. R. Sabbu, U. Ganugula, S. Kannan, and B. Bezawada, “An oblivious image

retrieval protocol,” in Proceedings of the 2011 IEEE Workshops of International

Conference on Advanced Information Networking and Applications, ser. WAINA

’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 349–354. [Online].

Available: http://dx.doi.org/10.1109/WAINA.2011.128

[4] J. Zhang, X. Liu, J. Luo, and B. Lang, “Dirs: Distributed image retrieval system

based on mapreduce,” in Pervasive Computing and Applications (ICPCA), 2010 5th

International Conference on, 2010, pp. 93–98.

[5] Y. Rui and T. S. Huang, “Image retrieval: Current techniques, promising direc-

tions and open issues,” Journal of Visual Communication and Image Representation,

vol. 10, pp. 39–62, 1999.

[6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani,

J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker, “Query by image and video

content: the QBIC system,” Computer, vol. 28, no. 9, pp. 23–32, 1995.

[7] Oracle intermedia users guide and reference. [Online]. Available: http:

//docs.oracle.com/html/A88786 01/mm cbr.htm

50

http://eprint.iacr.org/

http://dx.doi.org/10.1109/WAINA.2011.128

http://docs.oracle.com/html/A88786_01/mm_cbr.htm

http://docs.oracle.com/html/A88786_01/mm_cbr.htm

[8] D. H. Z. Dr. Fuhui Long and P. D. D. Feng, FUNDAMENTALS OF CONTENT-

BASED IMAGE RETRIEVAL.

[9] R. B. Gulfishan Firdose Ahmed, “A study on different image retrieval techniques in

image processing.”

[10] J. Eakins, M. Graham, J. Eakins, M. Graham, and T. Franklin, “Content-based

image retrieval,” Library and Information Briefings, vol. 85, pp. 1–15, 1999.

[11] C.-C. Chen and H.-T. Chu, “Similarity measurement between images,”

in Proceedings of the 29th annual international conference on Computer

software and applications conference, ser. COMPSAC-W’05. Washington,

DC, USA: IEEE Computer Society, 2005, pp. 41–42. [Online]. Available:

http://dl.acm.org/citation.cfm?id=1890517.1890538

[12] C. W. Niblack, R. Barber, W. Equitz, M. D. Flickner, E. H. Glasman, D. Petkovic,

P. Yanker, C. Faloutsos, and G. Taubin, “QBIC project: querying images by

content, using color, texture, and shape,” C. W. Niblack, Ed., vol. 1908, no. 1.

SPIE, 1993, pp. 173–187. [Online]. Available: http://dx.doi.org/10.1117/12.143648

[13] Apache hadoop. [Online]. Available: http://hadoop.apache.org/

[14] Mapreduce - hadoop wiki. [Online]. Available: http://wiki.apache.org/hadoop/

MapReduce

[15] T. G. Peter Mell, “The nist definition of cloud computing.”

[16] Hdfs users guide. [Online]. Available: http://hadoop.apache.org/docs/hdfs/current/

hdfs user guide.html

[17] Apache hbase. [Online]. Available: http://hbase.apache.org/

[18] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large

clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008. [Online].

Available: http://doi.acm.org/10.1145/1327452.1327492

[19] Mapreduce. [Online]. Available: http://wiki.apache.org/hadoop/MapReduce

[20] Eclipse. [Online]. Available: http://www.eclipse.org/

51

http://dl.acm.org/citation.cfm?id=1890517.1890538

http://dx.doi.org/10.1117/12.143648

http://hadoop.apache.org/

http://wiki.apache.org/hadoop/MapReduce


http://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html

http://hadoop.apache.org/docs/hdfs/current/hdfs_user_guide.html

http://hbase.apache.org/

http://doi.acm.org/10.1145/1327452.1327492


http://www.eclipse.org/

[21] Java platform, enterprise edition. [Online]. Available: http://en.wikipedia.org/wiki/

Java Platform, Enterprise Edition

[22] Java server pages overview. [Online]. Available: http://www.oracle.com/

technetwork/java/overview-138580.html

[23] Apache tomcat. [Online]. Available: http://en.wikipedia.org/wiki/Apache Tomcat

[24] Apache tomcat. [Online]. Available: http://tomcat.apache.org/

52

http://en.wikipedia.org/wiki/Java_Platform,_Enterprise_Edition

http://en.wikipedia.org/wiki/Java_Platform,_Enterprise_Edition

http://www.oracle.com/technetwork/java/overview-138580.html

http://www.oracle.com/technetwork/java/overview-138580.html

http://en.wikipedia.org/wiki/Apache_Tomcat

http://tomcat.apache.org/

private content based image information retrieval using map … · 2019-05-17 · private content...

Documents