m -sanit: evaluation of misusability measure based big ...anonymization with proximity -aware...

M-Sanit: Evaluation of Misusability Measure Based Big Data

Sanitization

1Y.SOWMYA,

2Dr.M.NAGARATNA,

3Dr.C.SHOBA BINDHU

1Dept. of Computer Science and Engineering, Research Scholar JNTUA, Anantapuramu, India

2Dept. of Computer Science and Engineering, Associate, Professor, JNTUH Hyderabad, India

3Dept. of Computer Science and Engineering, Professor, JNTUA, Anantapuramu, India

E-Mail: [email protected];

[email protected];

[email protected]

Abstract

Big data, in the wake of distributed computing technologies,

frameworks and cloud computing, has wherewithal to add big

value to enterprises. Due to exponential growth of data, it

became indispensable to have machine learning techniques to

discover useful patterns from it. Most of the existing data mining

techniques focused on extracting hidden trends from

databases. However, there risk of misusability of data. It is

more so in the era of cloud computing as data owners

outsource data to public cloud and do not maintain local copy.

The rationale behind this is that data is voluminous and cannot

be accommodated in the local machines. As cloud is untrusted

from user point of view, focussing on garnering of information

from such data is inadequate unless there is a mechanism to

withstand inference attacks or misuse of data. In our previous

work a framework named M-Sanit was proposed to have

misusability measure based big data sanitization using

Hadoop’sMapReduce programming paradigm. In this paper we

threw light into the evaluation of M-Sanit prototype in usage of

misusability measure based sanitization of big data. The

evaluation is required as to standardize the prototype by

experimenting different levels of sanitization based on

misusability measure and derive thresholds for each level which

can ensure optimal sanitization of big data with appropriate

level of privacy.

Index Terms – Big data, misusability measure, sanitization, M-

Sanit

International Journal of Pure and Applied MathematicsVolume 119 No. 18 2018, 1859-1870ISSN: 1314-3395 (on-line version)url: http://www.acadpubl.eu/hub/Special Issue http://www.acadpubl.eu/hub/

1859

1 Introduction

Data is an asset to an organization due to data-driven decision making.

In every domain, giving data to business partners is required to perform

their duties. Having said this, putting limits on the data access in the

interest of maintaining privacy might make them handicap in fulfilling

their duties. This is one side of the coin which supports giving sufficient

data to partners. The other side of the coin is the privacy issues due to

malicious insiders. Therefore, it is essential to have mechanism for

detecting data misuse and data leakage in place. However, it is very

challenging task to detect malicious insiders. According to a survey

made by Cyber Security Watch [31], in a year, 26% of cyber security

attacks are made by malicious insiders only. Around 43% of the victims

of cyber attacks reported that those insiders are most damaging.

Leakage of sensitive data was at 15% and theft of sensitive data was at

16%. These statistics reveal the severity of the security and privacy

issues with malicious insiders.

In order to mitigate misuse of data in the context of big data, in our prior

work [29] we proposed algorithms for parallelizing k-Anonymity for

privacy preserving big data mining using MapReduce framework. It was

to protect big data from inference attacks. Then we proposed a

framework known as M-Sanit [32] for effective sanitization of big data

using MapReduce paradigm. This framework exploits extended

misusability score measure before determining level of sanitization

required. This measure was based on the work of [10]. It uses our

algorithm named Misusability Measure-based Big Data Sanitization

(MMBDS). This algorithm is used by M-Sanit to mitigate misusability

probabilities. However, this algorithm was not evaluated formally though

it is capable of applying appropriate level of sanitization based on

misusability measure.

The focus of this paper is to evaluate the M-Sanit framework and

provide conclusions on the level of sanitization for protecting big data

from leakage and abuse. As the insiders do have legitimate rights to

gain access to data, it is crucial to have such mechanism to prevent

malicious activities from them. The existing privacy preserving data

mining techniques such as k-Anonymity [11], l-Diversity [33] and t-

Closeness [24] do provide privacy to data but they are to be improved

with MapReduce programming approach for big data anonymization.

Towards this end we studied k-Anonymity by redefining it for

MapReduce [29]. However, it does not use misusability measure to

determine the level of anonymization. To exploit the extended

misusability measure [32], we proposed M-Sanit for misusability

measure based big data sanitization. Thus this framework achieves dual

purpose of mining big data and protecting it from misuse. Our

contributions in this paper are as follows.

We proposed a methodology for evaluating M-Sanit [32] for

misusability based big data sanitization. Especially it focuses on

International Journal of Pure and Applied Mathematics Special Issue

1860

the level of sanitization needed based on the misusability

probability detected on the given data.

We evaluated the MMBDS algorithm with experiments to

generalize thresholds for different levels of sanitization in

response to the value of misusability measure.

We built a prototype application to demonstrate parallelization of

k-Anonymity, computation of misusability measure on big data

and the utility of M-Sanit framework with MapReduce

programming paradigm in distributed environment.

The remainder of the paper is structured as follows. Section 2

reviews related literature on privacy preserving data mining

techniques and privacy issues and prevention measures in the

context of big data. Section 3 presents our methodology to

evaluate M-Sanit framework. Section 4 presents results of

evaluation. Section 5 concludes the paper and provides

directions for future scope of the research

.

2Related Works This section provides literature on privacy preserving data mining

techniques, privacy in big data and misusability of data. Liu et al. [1]

proposed a method based on perturbation for privacy preserving data

mining in distributed environment. Similar kind of research is made in

[4]. Bonomi et al. [2] proposed an information-theoretic approach to

achieve sequential sanitization. Xu and Jaing [3] on the other hand

proposed a framework for privacy preserving categorization techniques

on big data. Matatov et al. [5] proposed a feature-set partitioning

approach for privacy preserving data mining. Cheng and Kumar [6]

studied sanitization of log files for protecting the data from privacy

attacks. Li et al. [7] studied outsourcing of data with privacy preserved.

Towards this end, they used sanitizing and minimizing databases with

reference to outsourcing of software tests.

Hong et al. [8] applied differential privacy and boosted utility for

sanitization of collaborative search logs. Thus the search logs can be

distributed to stakeholders without worrying about internal attacks on

privacy. Zhang et al. [9] proposed a methodology for local recording

anonymization with proximity-aware approach using MapReduce

programming paradigm. Harel et al. [10] proposed a misusability

measure known as M-score. Its extension is known as extended

misusability measure proposed in our previous work [32]. Chiu and Tsai

[11] proposed a data privacy preservation method known as k-

anonymity clustering. Chakravorty et al. [12] on the other hand

considered smart homes for privacy preserving data analytics. Various

data privacy issues associated with big data is provided by Smith et al.

[13]. Big data issues and challenges are explored in [17]. Similar kind of

work is done in [18].

When knowledge is to be discovered from big data, design principles

useful for efficiency are proposed in [14]. Data mining for big data [15]


1861

and future of enterprise computing [16] explore the respective ideas on

big data processing. Risk control for big data in terms of assessing risk

and taking corrective measures is explored in [19]. Recognizing big data

value [20], scalable k-anonymization [21], protection of big data privacy

[22], different tools for big data processing with privacy preserved [23],

different anonymity techniques such as k-anonymity, l-diversity and t-

closeness [24], privacy preserving techniques for pervasive systems

[25], privacy approaches in the data pertaining to Internet of Things (IoT)

[26], the concept of decentralization of privacy preserving policies [27]

and t-closeness along with differential privacy [30] are other important

researches found in the literature. In this paper, we evaluated privacy

preserving of big data based on misuseability measure computed. Thus

it provides appropriate sanitization to be more effective serving dual

purpose of achieving privacy and utility.

3 Methodology ToEvalute M-Sanit

M-Sanit framework is meant for minimizing data leakage or misuse risk

by exploiting the probability of misusability of big data. The MMBDS

algorithm presented in this section is defined in our work [31] for

determining level of sanitization needed based on the value of

misusability measure. More details on the M-Sanit and the MMBDS

algorithm includerrs (1), rdf (2), frs (3) and ms (4) equations pertaining to

extended misusability score can be found in [31]. The main purpose of

this paper is to evaluate the algorithm with M-Sanit framework which

takes dataset (big data) as input and produce sanitized dataset.

However, this sanitization is based on the determination of level of

sanitization needed. This is achieved by using the misuse probability

(0.0 to 1.0) provided by the extended misusability measure explored in

[31]. This measure takes big data as input and computes probability of

misuse of the data. Based on the probability, appropriate level of

sanitization is required for privacy preserving mining of big data. Which

level of sanitization is appropriate based on the misuse probability is the

hypothesis tested in this paper.


1862

Algorithm 1: MMBDS algorithm [31]

This algorithm computes misusability score from big data and tells the

level of sanitization needed. However, it is vague to say level 1 or level 2

or level 3. Therefore, we are evaluating the process of application of

sanitization and the level of it really needed for protecting big data from

privacy attacks made by malicious insiders. Proactive misusability

reduction is the main purpose of M-Sanit. It does mean that data is

modified or sanitized so as to reduce misuseability level. At the same

time, it is important to see that the sanitized data should be useful for

performing mining tasks. Therefore it is important to consider both the

things such as reducing misusability probability and ensuring utility or

usefulness of data to partners who perform their tasks meaningfully.

Measuring the risk of exposing data is done first using our extended

misuseability measure. Once it is done, reducing the risk is taken care of

by sanitization. Then it is important to know the utility of the data mining

algorithms when sanitization is employed.

Evaluation is made by considering a classification algorithm like Naive

Bayes classifier with the MMBDS algorithm in place. Different levels of

sanitization are evaluated using our parallelized k-Anonymity [29]. The

evaluation measures considered are precision, recall, and Root Mean

Square Error (RMSE) against different levels of anonymity. Precision

and recall measures are based on the confusion matrix provided in

Table 1.

Table 1: Confusion matrix used for evaluation


1863

Ground Truth (correct

classification decision)

Ground Truth (incorrect

classification decision)

Algorithm

(correct

classification

decision)

True Positive (TP) False Positive (FP)

Algorithm

(incorrect

classification

decision)

False Negative (FN) True Negative (TN)

Precision = (TP/(TP+FP)*100 (1) Recall = (TP/(TP+FP))*100 (2)

Precision tells the ratio of correctly classified instances to the number of

instances available in the dataset. In the same fashion, recall tells the

number of correctly classified instances to the number of correctly

classified instances present in the dataset. RMSE is computed as

follows.

4 Experimental Results

Experiments are made with Mushroom dataset from UCI machine

learning repository using M-Sanit framework and the results are

evaluated.

Table 2: Shows classification accuracy and RMSE against k value

k-Anonymity

Level

Classification Accuracy RMSE

No

anonymization

0.95456 0.1756

K=5 0.95023 0.1751

K=10 0.94605 0.1795

K=20 0.9362 0.1821

K=25 0.934505 0.1873

K=30 0.923827 0.1885


1864

K=40 0.914120 0.1913

K=45 0.907924 0.1935

K=50 0.907021 0.1941

Above table 2 shows that when k-Anonymity level changes the relevant

classification Accuracy and RMSE values are also changed. If K value

increases Accuracy decreases and RMSE value increases. The highest

value of RMSE noted at K=50 and Least value at K=5. The highest

value of accuracy noted at no anonymization.Least value noted at K=50.

Above Figure differentiates between Accuracy and Anonymization. If K

value increases Accuracy decreases. The highest value of accuracy

noted at no anonymization having value is 0.95456. Least value noted at

K=50 having value is 0.907021.

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

Cla

ssif

icat

ion

Acc

ura

cy (

%)

Level of Anonymity

Classification Accuracy


1865

Above Figure differentiates between RMSE and Anonymization. If K

value increases Accuracy decreases. The highest value of accuracy

noted at no anonymization having value is 0.1941 Least value noted at

K=50 having value is 0.1756

Table 3: Precision and recall measures

k-Anonymity

Level

Precision Recall

No

anonymization

0.961 0.957

K=5 0.957 0.953

K=10 0.950 0.946

K=20 0.940 0.936

K=25 0.937 0.933

K=30 0.925 0.923

K=40 0.915 0.913

K=45 0.909 0.908

K=50 0.908 0.906

Above table 3 shows that when k-Anonymity level changes the relevant

classification Precision and Recall values are also changed. If K value

increases Precision decreases and Recall value also decreases. The

highest value of Recall noted at no anonymization and Least value at

0.165

0.17

0.175

0.18

0.185

0.19

0.195

0.2

RM

SE

Level of Anonymity

RMSE


1866

K=50. The highest value of accuracy noted at no anonymization.Least

value noted at K=50.

Above Figure differentiates between Precision/Recall and level of

Anonymization. If K value increases Precision/Recall decreases. The

highest value of Precision noted at no anonymization having value is

0.961. Least value noted at K=50 having value is 0.908. The highest

value of Recall noted at no anonymization having value is 0.957. Least

value noted at K=50 having value is 0.906

5 Conclusions And Future Work In this paper we proposed a methodology to evaluate the functionality of

the M-Sanit which is a framework proposed by us [31] for misusability

based sanitization of big data. Since the level of sanitization is based on

the misusability measure which provides probability of big data misuse,

this paper focused on evaluation of the work by using precision, recall,

classification accuracy and RMSE. Naive Bayes classifier is used along

with parallelized k-Anonymity proposed in our previous work [29] for

evaluation of the M-Sanit. The results reveal that it is important to

understand the utility of big data also after anonymization and domain

experts need to set threshold based on the usability of sanitized data.

References [1] Kun Liu, HillolKargupta And Jessica Ryan, Random Projection-

Based Multiplicative Data Perturbation For Privacy Preserving

0.87

0.88

0.89

0.9

0.91

0.92

0.93

0.94

0.95

0.96

0.97P

rec

isio

n /

Re

ca

ll

Level of Anonymity

Precision

Recall


1867

Distributed Data Mining, Ieee Transactions On Knowledge And Data Engineering, 18 (1), (2006) ,92-106.

[2] Luca Bonomi,Liyue Fan And Hongxia Jin, An Information-Theoretic Approach To Individual Sequential Data Sanitization, (2016), 337-346.

[3] Lei XuAndChunxiao Jiang, A Framework For Categorizing And Applying Privacy-Preservation Techniques In Big Data Mining, (2016), P54-62.

[4] Li Liu, Murat Kantarcioglu And BhavaniThuraisingham, The Applicability Of The Perturbation Based Privacy Preserving Data Mining For Real-World Data, Elsevier, (2008) , 5-21.

[5] NissimMatatov, LiorRokach, And OdedMaimon, Privacy-Preserving Data Mining, A Feature Set Partitioning Approach, Elsevie, (2010), 2696-2720.

[6] Hsin-Jung Cheng And Akhil Kumar, Process Mining On Noisy Logs Can Log Sanitization Help To Improve Performance, Elsevier, (2015), 1-12.

[7] Boyang Li, Mark Grechanik And Denys Poshyvanyk, Sanitizing And Minimizing Databases For Software Application Test Outsourcing, Ieee International Conference On Software Testing, (2014), 1-10.

[8] Yuan Hong,JaideepVaidya,HaibingLu,PanagiotisKarras And Sanjay Goel, Collaborative Search Log Sanitization, Toward Differential Privacy And Boosted Utility. Ieee Transactions On Dependable And Secure Computing, (2014), 1-16.

[9] Xuyun Zhang, Wanchun Dou, JianPei,SuryaNepal,Chi Yang, Chang Liu, And Jinjun Chen, Proximity-Aware Local-Recoding Anonymization With Mapreduce For Scalable Big Data Privacy Preservation In Cloud, Ieee Transactions On Computers, (2013), 1-14.

[10]Amir Harel, AsafShabtai, LiorRokach, And Yuval Elovici, M-Score, A Misuseability Weight Measure, Ieee Transactions On Dependable And Secure Computing, 9 (3), (2012), 1-15.

[11]Chuang-Cheng Chiu And Chieh-Yuan Tsai,A K-Anonymity Clustering Method For Effective Data Privacy Preservation, (2007), 89-99.

[12]AntorweepChakravorty, Tomasz WlodarczykAndChunmingRongprivacy Preserving Data Analytics For Smart Homes, (2013), 1-5.

[13]Matthew Smith, Christian Szongott,BenjaminHenne And Gabriele Von Voigt, Big Data Privacy Issues In Public Social Media, (2012), 1-6.

[14]EdmonBegoli And James Horey, Design Principles For Effective Knowledge Discovery From Big Data, (2012), 1-4.

[15]XindongWu,XingquanZhu,Gong-Qing Wu And Wei Ding, Data Mining With Big Data, Ieee Transactions On Knowledge And Data Engineering. 26 (1), (2014), 1-11.

[16]Juhnyoung Lee, The Future Of Enterprise Computing, (2013), 1-1.

[17]AvitaKatal,MohammadWazid And R H Goudar, Big Data, Issues, Challenges, Tools And Good Practices, (2012), 1-6.


1868

[18]Elisa Bertino, Big Data - Opportunities And Challenges Panel Position Paper, Ieee, (2013), 479-480.

[19]Duncan Hodges And Sadie Creese, Breaking The Arc, Risk Control For Big Data, Ieee, (2013), 613-621.

[20]Ningyuxin And Liyueling, How We Could Realize Big Data Value, Ieee, (2013), 425-427.

[21]AntorweepChakravorty, Tomasz WiktorWlodarczyk And ChunmingRong, A Scalable K-Anonymization Solution For Preserving Privacy In An Aging-In-Place Welfare Intercloud, Ieee International Conference On Cloud Engineering, (2014), 424-431.

[22]AbidMehmood, IynkaranNatgunanathan, Yong Xiang,GuangHua And Song Guo, Protection Of Big Data Privacy, Ieee, 4, (2016), 1821-1834.

[23]Chris Clifton, Murat Kantarcioglu,Xiaodong Lin And Michael Y. Zhu,Tools For Privacy Preserving Distributed Data Mining, 4 (2), (2002), 1-7.

[24]Ninghui Li, Tiancheng Li And Suresh Venkatasubramanian, T-Closeness, Privacy Beyond K-Anonymity And-Diversity, (2005), 1-10.

[25]Claudio Bettini And Daniele Riboni, Privacy Protection In Pervasive Systems: State Of The Art And Technical Challenges, (2014), 1-36.

[26]MahadevSatyanarayanany, Pieter Simoensz, Yu Xiao, PadmanabhanPillai, ZhuoCheny, Kiryong Hay, WenluHuy And Brandon Amosy, Edge Analytics In The Internet Of Things, (2004), 1-6.

[27]G. Zyskind, O. Nathan And A. Pentland, Decentralizing Privacy, Using Blockchain To Protect Personal Data, (2016), 1-4.

[29]Y. SowmyaAnd Dr. M. Nagaratna, Parallelizing K-Anonymity Algorithm For Privacy Preserving Knowledge Discovery From Big Data. International Journal Of Applied Engineering Research, 11 (2), (2016), 1314-1321.

[30]Josep Domingo-Ferrer And JordiSoria-Comas, From T-Closeness To Di Erential Privacy And Vice Versa In Data Anonymization, (2015), 1-20.

[31]Cyber Security Watch Survey, Http://Www.Cert.Org/Archive/Pdf/Ecrimesummary10.Pdf, 2012.

[32]Y. Sowmya And Dr. M. Nagaratna, M-Sanit: A Framework For Effective Big Data Sanitization Using Map Reduce Programming In Hadoop. International Journal Of Applied Engineering Research, 11 (2), (2017), 1314-1321.

[33]A. Machanavajjhala, J. Gehrke, D. Kifer, And M. Venkitasubramaniam, L-Diversity: Privacy Beyond K-Anonymity, In Proc. 22nd Intnl. Conf. Data Engg. (Icde), (2006), 1-24.


1869

m -sanit: evaluation of misusability measure based big ...anonymization with proximity -aware...

Documents