on improving the performance of data leak prevention using white-list approach

On Improving the Performance of Data Leak Prevention using

White-list Approach

An-Trung B. Nguyen1

Faculty of Information Technology, University of Science Hochiminh City,

Vietnam (84)976205232

[email protected]

Trung H. Dinh2

Faculty of Information Technology, University of Science Hochiminh City,

Vietnam (84)988913415

[email protected]

Dung T. Tran3 Faculty of Information Technology,

University of Science Hochiminh City, Vietnam

(84)932408618

[email protected]

ABSTRACT Fang Hao et al. [1] proposed a Data Leakage Prevention (DLP)

model by combining 128-bit CRC (Cyclic Redundancy Check)

and the Bloom filter [2]. In this paper, we improve on the work

done by Fang Hao et al.[1] by using hash functions combining

with the Bloom filter. The experiment results showed that our

approach has significantly improved the system’s throughput. In

our experiments, we used a 9.3 GB of html files with more than

121,000 files.

Keywords Data leak prevention, bloom filter, fingerprint, hash function.

1. INTRODUCTION

In recent years, the trend of storing data on the cloud has been

growing fast because of the convenience which comes from this

kind of service. Looking beyond the significant advantages, there

is an important problem that users and service providers have to

consider: the safety of the data. In the cloud environment,

malicious users are always ready hijack or steal sensitive data for

a wide variety of purposes. As a result, cloud service providers are

always looking for new technologies to improve the level of

security which makes such activity more difficult or impossible.

Figure 1 The number of incidents over time (datalossdb.org)

As shown in fig. 1 – taken from DatalossDB [3] – there were over

115 data leakage incidents each month during 2013. Some of

these incidents affected millions of users. Such leakages happened

despite protection mechanisms such as firewalls and IPS/IDS

devices. It seems that current protection tools have failed to

prevent the zero-day attacks, application level compromises, bugs

or misconfigurations. Moreover, most incidents come from the

outside. This demands new approaches to protect data in cloud

storages.

Figure 2 Incident source

As we mentioned earlier, the DLP model – proposed by Fang Hao

[1] in order to prevent the data leakage – builds a fingerprint

database sitting at the border of the networks: all network traffic

has to be inspected. If there is sensitive data, the traffic will be

stopped to minimize the number of leaking bytes. Fang Hao [1]

used a 128-bit CRC to create the fingerprint. This approach is

simple but not efficient for the system’s performance, since

building a database requires system resources and every check for

the data leak need to re-run CRC. We propose an alternative

approach which improves the process of creating the fingerprint

while still keeping the fingerprint collision at a low rate.

Checking every single byte of outgoing files is impractical in

terms of memory usage and computational cost so the outgoing

files should be checked at some sample points, say

t1<t2<t3<…<tk. Those checking points will be determined at the

beginning. Based on these checking points, we create the

fingerprint for every segment of every file in the database and

store them by using the Bloom filter [2]. When users try to

download a file, the DLP will divide the file into many parts

according to the checking points; and then calculate their

fingerprints which are compared with the ones stored in the

Bloom filter. If any mismatches occur, this means that the file is

not on the white list. Consequently, the transmission will be

immediately terminated.

The initial step is one of the most important steps in our proposal.

Since the Bloom filter is a probabilistic data structure, the size of

the filter is directly related to the false positive probability. On the

other hands, the collision probability of hash functions is also an

important impact factor to the performance. We will show how

this is so in the experiment section.

In this work, we conduct the experiments of five innovative hash

functions from two popular groups: the cryptographic and non-

cryptographic hash functions. We use the system’s throughput and

the percentage of leaked files to evaluate the hash functions.

This paper has 6 sections. Section 1, the Introduction, sets up the

purpose of the study and determines the standards that will be

utilized to measure the results as well as touches on other

similar studies. We summarize related work in Section 2. In

section 3, we present the Bloom filter, and then the hash functions

in section 4. We present the implementation and experiments’

results in section 5. Our conclusions, with discussion, are

presented in section 6.

2. RELATED WORK

MyDLP[4], IronPort[5] have used a similar approach which is a

network-based prevention model. They applied the two following

techniques:

- Use a black-list to detect data leak.

- Use keywords or regular search expressions to detect

the same.

For instance, WebDLP[6,7], IronPort[5] use similarity based

content classifiers to identify content, then compare it to the

confidential black-list contents. My DLP[6] also uses a black-list

combined with keywords, regular expressions, and full-file hashes

to prevent the data leak problem. These approaches are not fool

proof in that attacker can encrypt or transform content to avoid the

black-list. False positives can also be high because valid data may

contain some of the same keywords in its documents.

Conversely, Fang Hao et. al. [1] proposed a different approach by

extracting fingerprints from each document in white-list, and

building a fingerprint database store at the network boundary. For

each data transmission connection, they inspect outgoing data,

calculate fingerprints of data and check against the database. If the

fingerprint at any checkpoint of the transmission does not match

the database, this transmission is terminated. To reduce overhead

of memory, Fang Hao [1] proposed an algorithm to optimize the

checkpoints’ locations while restraining the worst case data

leakage length.

Figure 3. Add an element to bloom filter

To develop from the work of Fang Hao et. al.[1], we improved the

process speed at the fingerprint creation step. Our approach is

faster but with the same accuracy. We used several different hash

functions to generate fingerprints of documents. Our purpose was

to find a hash function with a high speed and a lower amount of

collision. In order words, to create fingerprints faster, yet still

ensure their accuracy.

3. BLOOM FILTER

A Bloom filter is a data structure designed to tell you, rapidly and

with low memory overhead, whether an item is present in a set.

The price paid for this efficiency is that a Bloom filter is a

probabilistic data structure: if it tells us that the element is not in

set, this element will not be definitely in the set. By contrast, this

element may be or not may be in the set with probability depends

on number of bits of the Bloom filter.

In general, to add an item to the set, we use some hash functions

to calculate the hash values of the item. After that, these hash

values will be the index values used as the input to the bloom

filter.

The idea above is to allocate an array v of m bits, initially all set to

0, and then choose k independent hash functions, {h1,h2,h3,..,hk},

each with range (1...m). For each element a A, the bits at

positions h1(a), h2(a), ..., hk(a) in v are set to 1. It might lead to a

particular bit being set to 1 multiple times. Given a query for b to

check the membership we check the bits at positions h1(b), h2(b),

..., hk(b). If any of them is 0, then certainly b is not in the set A.

Otherwise we conjecture that b is in the set although there is a

certain probability that we are wrong. This is called a “false

positive”. The parameters k and m should be chosen such that the

probability of a false positive (and hence a false hit) is acceptable.

Figure 3 Check membership of an item

The salient feature of Bloom filters is that there is a clear trade-off

between m and the probability of a false positive. According to

[8,9], after inserting n keys into a table of size m, the probability

that a particular bit is still 0 is:

(1)

Hence, the probability of a false positive in this situation is

(2)

The right hand side of (2) is minimum when

, it

implies:

(3)

In fact k must be an integer, in practice we might chose a value

less than the optimal value to reduce the memory usage. In order

to obtain the expected false positive, we have to consider the

trade-off between the Bloom filter’s size and the false positive

probability. In order words, a system with larger physical memory

can achieve a better false positive value.

In our approach, the index of the Bloom filter is calculated from

the hash value of the item which is the segment between 2

checking points of a document. Hash collision occurs when two

different items give the same hash value. In this paper, we use

experiments to find out the best hash function which gives both a

higher speed and a lower collision rate. The next section will

analyze the collision of hash functions.

4. HASH FUNCTION

4.1 Collision probabilities

Given a set of N distinct hash codes, picked one value. The N−1

remaining hash codes (out of N possible hash codes) are also

distinct. Therefore, the probability of randomly generating two

distinct hash codes out of N possible codes is

.

Consequently, the probability of randomly generating three

distinct hash codes is

. Since generating a random hash

code is an independent event, the probability of generating k

distinct hash codes is the multiplication of the probabilities of

generating each single hash code.

In general, the probability of generating k distinct hash codes is:

(4)

This would be computational difficult to find a large k.

Fortunately, the equation (4) can be approximated by:

(5)

Figure 5 illustrates the probability of collision when using 32-bit

hash codes. It’s worth noting that collision occurs with probability

of 1/2 when the number of hashes (x-axis) is around 77000. We

also note that the graph takes the same S-curved shape for any

value of N. Since the hash code will be the index of the bloom

filter, the hash function must have two properties: be fast and

provide less collision.

4.2 Hash function selection

The fingerprint has to be long enough to make the probability of

two different items having the same fingerprint infinitesimally

small. For practical applications, a 128 or 256-bit fingerprint

would be sufficient. In our experiments, we examine two types of

hashing: cryptographic and non-cryptographic.

Figure 5. Collision of 32-bit Hash function

5. IMPLEMENTATION AND

EXPERIMENTS

The implementation of our approach consists of three steps:

Preprocessing

Filter construction

On-line checking

5.1 Preprocessing

We first examine the existing file set and generate the file size

distribution. Then we run the dynamic programming algorithm

that mentioned in the paper [1] to compute the optimal filter

placement strategy, including filter locations, number of hash

functions, and filter sizes. There is usually a straightforward

mapping between the expected false positive probability and total

amount of memory needed for the Bloom filter.

5.2 Filter construction

In this step, we construct the actual Bloom filter based on the

optimal filter computed in previous step. In general, Bloom filters

may need multiple hash functions. This can be achieved by

0

20

40

60

80

100

120

0 100000 200000 300000 # hashes

% collision

appending multiple random numbers at the end of the data and

computing a new value corresponding to each hash function. We

then use the hash value mod m – the number of bits in the filter –

to map the hash values into a bit index in the Bloom filter. We

repeat this process for all files.

5.3 On-line checking

Once the filters are constructed, the system is ready to launch. At

this step, data traffic is processed and the hash checksums are

continuously computed. Finally, the Bloom filter checking is done

at each pre-determined location. A data flow is allowed to

continue if the Bloom filter gives a hit result. Otherwise the flow

is dropped. To minimize the leakage of small files, we mandate a

check at the end of the data flow, before releasing the last packet.

In this way, small files that are contained in a single packet can

always be checked.

5.4 Experiment scenarios

We used 9.3GB of data to evaluate our algorithm. In order to

collect the data, we used wget command in Linux to download

files on the Internet. In this paper, we crawled on the following

web pages:

- www.tools.ietf.org

- www.tuoitre.vn

- www.bbc.co.uk

- www.nld.com.vn

These web pages are news and scientific pages. Downloaded files

were classified by the file type, and the size of files. To

automatically download the whole site, we used the following

command:

wget -x --convert-links -r address of website

In the experiments with the Bloom filter, data files were processed

following the three implementation steps described in the previous

sections. At the last step, to simulate data leakage, we randomly

selected 1000 files, and then randomly selected one byte within

each file as the “bad byte” to change its value. For each file, we

computed the incremental hash value up to each Bloom filter

location, and checked if such fingerprints were present at the

location. The bad byte was detected when the check failed. The

difference between the detection location and location of the “bad

byte” is the number of bytes that are leaked. This is called the

detection lag.

5.5 Experimental results

Firstly, we used five different hash functions to calculate all the

fingerprints for the same html file set. According to the

experiments, the cryptographic hash functions such as SHA1[10],

MD5[11] spent more time than the non-cryptographic ones such

as Murmur3[12], JinkinHash[13], FNVHash[14] and

CityHash[15]. The experiments’ results show that the number of

leaked files was around 4-5 out of 1000 randomly selected files. It

means the false positive probability is approximately 0.5 percent.

Fig. 6 shows the comparison of hash functions’ speeds when they

hashed all of the data files. The y-axis is the total time needed to

create all the fingerprints. This result showed that City hash is the

fastest hash function. City hash is faster than CRC128 by

approximately 200 seconds. CRC128, SHA1 and MD5 took

around 1500 seconds longer than the City hash. JenkinHash was

as fast as the FNVHash.

According to the above results, the hash functions in non-

cryptographic hash groups have better speed than in cryptographic

hash groups. Although CRC-128 is also a non-cryptographic hash

function, its speed is slower than SHA1 and MD5 – the

cryptographic hash functions. Moreover, CRC-128 and non-

cryptographic hash functions such as JenkinHash, FNVHash and

City Hash have approximately the same collision probability. As a

result, we can use the City hash, FNVHash or JenkinHash instead

of CRC-128 to speed up our system.

Figure 6. Speed of hash functions

Figure 7. Leaked files

Fig. 7 shows the number of leaked files when we checked with

1000 random files. The number of leaked files waives from 4 to 7.

The Bloom filter approach works much better than the Fingerprint

Comparison approach (FCA)[1] because it saves memory usage

and CPU processing. In our experiments, when the fingerprints

for all the files were created, the Bloom filter required just

5.12MB with the given expected detection lag of 1000 bytes. In

comparison with the FCA, the Bloom filter needed 50 times less

1300 1350 1400 1450 1500 1550 1600 1650 1700 1750 1800

second

0

1

2

3

4

5

6

7

8

0 200 400

600

800 1000 1200 1400 1600 1800

2000

second # leaked files

memory than FCA’s. This is because FCA has to stores each fingerprint instead of just asserting a single bit as in the Bloom

filter. The Bloom filter does not need to compare the whole

fingerprint when it checks an incoming string. As a result, this

improves the throughput when users download documents from

our database.

Fig. 8 shows the average detection lag when we apply different

hash functions to create fingerprints. The results show that the

detection lags are not much different. This means that using the

City hash could improve the system’s throughput while keeping a

comparable leaking rate.

Figure 8. Average detection lag

6. CONCLUSION AND DISCUSSION In this paper, we applied the non-cryptographic hash functions to

create fingerprints for documents which are used to detect and

prevent the leaked data. Our experiments’ results show the

proposed method has improved the system’s throughput while

keeping the same level of data leaking in comparison with other

approaches in [1].

In our experiments, we used a single core CPU with limited RAM

– 4 GB – because of the shortage of resources. In the practical

cloud environment, the result could be far better since the

available RAM would be much larger. The system could take this

advantage to reduce the false positive rates. We expect a

significant improvement if this approach is deployed in a multi-

core CPU system.

7. REFERENCES [1] Fang Hao, Kodialam, M., Lakshman, T.V., Puttaswamy,

K.P.N., "Protecting cloud data using dynamic inline fingerprint

checks," INFOCOM, 2013 Proceedings IEEE , vol., no.,

pp.2877,2885, 14-19, April 2013.

[2] Burton H. Bloom. “Space/time trade-offs in hash coding with

allowable errors”. Communications of the ACM 13, 422-426, July

1970.

[3] Data Loss DB http://www.datalossdb.org/statistics

[4] Mydlp - data leak prevention, http://www.mydlp.com.

[5] Cisco, “Cisco ironport data loss prevention,”

http://www.ironport.com/kr/technology/ironport dlp

overview.html.

[6] S. Yoshihama, T. Mishina, and T. Matsumoto, “Web-based

data leakage prevention,” in Proceedings of IWSEC, 2010.

[7] http://www.websense.com/content/home.aspx - WebDLP

[8] Sergey Butakov, “Using Bloom Filters in Data Leak

Protection Applications.” DART@AI*IA, volume 1109 of CEUR

Workshop Proceedings, page 13-24. CEUR-WS.org, 2013.

[9] Fang Hao, Murali Kodialam, and T. V. Lakshman. “Building

high accuracy bloom filters using partitioned hashing”. In

Proceedings of ACM SIGMETRICS '07, New York, USA, 2007.

[10] http://tools.ietf.org/html/rfc3174 - SHA Algorithm

[11] http://www.ietf.org/rfc/rfc1321.txt - MD5 - MD5 Algorithm

[12] https://code.google.com/p/smhasher/ - Murmur 3

[13] http://burtleburtle.net/bob/hash/doobs.html - Jenkins Hash

[14] https://tools.ietf.org/html/draft-eastlake-fnv-07 - FVNHash

Algorithm

[15] https://code.google.com/p/cityhash/ - CityHash Algorithm

184.8

185

185.2

185.4

185.6

185.8

186

186.2

186.4

second

(second)

http://tools.ietf.org/html/rfc3174

http://www.ietf.org/rfc/rfc1321.txt%20-%20MD5

https://code.google.com/p/smhasher/

http://burtleburtle.net/bob/hash/doobs.html

https://tools.ietf.org/html/draft-eastlake-fnv-07

https://code.google.com/p/cityhash/

on improving the performance of data leak prevention using white-list approach

Documents