secure data outsourcing

Secure Data Outsourcing

Outline Motivation Background Research issues Summary

Motivation Cost of maintaining/mining large data

4-5 times of the cost of data acquisition DBAs are paid well

More and more data service providers Low cost – cloud computing

Maintain one database for one user multiple users Examples:

Alentus.com Datapipe.com Discountasp.net …

Concerns about data security and privacy Untrusted service provider

Un-trusted service provider Lazy: incentives to perform less Curious: incentives to acquire

information Malicious:

Denial of service Incorrect results Possibly compromised

Challenges Data confidentiality

Data need to be encrypted (?) Utility of protected data?

Query utility Mining utility

Access pattern privacy Integrity

Data integrity Query integrity

Correct Complete Fresh

Why is it hard for query services? Arbitrary expressivity

SQL statements Often, restricted for certain type of query

for simplicity (e.g. range query, knn query)

Cost Communication Computation (server side vs client side)

Why it is hard for mining services? Many data mining models

Different utilities to preserve No one-size-for-all solutions

Data confidentiality Bucketization method (crypto-index) Order preserving encryption Perturbations

Bucketization method Hacigumus (SIGMOD02)

Main steps Partition sensitive attributes

Order preserving: supports comparison Random: query rewriting becomes hard

Build index on the partitions Rewrite queries to target partitions

‘john doe’ 105 Select * from T’ where name=105

Execute queries and return results Prune/post-process results on client

Trade off between confidentiality and overhead Larger partition increased privacy

increased overheads

Order preserving encryption Agrawal2004, Boldyreva2009 The set of data is securely

transformed so that the order is preserved but the distribution and domain are changed

Benefits: indexing/searching on OPE encrypted data

Weakness: once the original distribution is known, OPE is broken

Not attribute-wise order preserving Order preserving encryption (OPE, Agrawal et al

2004) is not resilient to distribution-based attacks

Original Xi distribution is known Transformed Xi’ distribution

OPE

Bucket basedEstimation

Data perturbation Definition

1. randomly change the original data2. the attacker cannot effectively recover

the original data 3. the desired properties are preserved

Techniques Single dimension: noise addition Multidimensional

Geometric perturbation Random projection RASP random space perturbation

Noise addition Y = X+ R

X: original data column, R: random noise (distribution published), Y: published data

Applications in data mining Reconstructing column distribution

Rakesh Agrawal SIGMOD 2000 Applied to privacy-preserving decision tree, naïve

bayes classifier

Attacks Spectral filtering (Kargupta ICDM 2004) PCA reconstruction (Huang SIGMOD2005)

Multiplicative perturbations Geometric data perturbation for

outsourced data mining Random Projection RASP perturbation for query services

(range query, kNN query).

Perturbation-based framework

Mining service

Geometric data perturbation Y=RX+T+D

R: secret rotation matrix (preserve Euclidean distances) T: secret random translation matrix, D: secret random

noise matrix Distances are approximately preserved (D) Resilient to most attacks to rotation perturbation

Applications Outsourced privacy preserving data mining, applicable

for many classification and clustering algorithms

Attacks Population based attacks (when covariance matrix is

revealed)

Random Projection Y=AX+D

A: random projection, e.g., entries from N(0,1)

Distances are approximately preserved Applications

Many classification and clustering algorithms Worse accuracy than geometric perturbation

Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)

Attacks Possibly more resilient than other two

perturbation methods But utility (distance) is not well preserved

RASP perturbationk-dimensional numeric data, n records, represented as a k x n matrix, x: a record

(1) Extend x to k+2 dimensions - (K+1) th dimension is always 1 – homogeneous dimension- (K+2) th dimension v is a real random number drawn from

(2) Encryption

- A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values

- A is shared by all records

Properties Not an OPE Preserves convexity of the dataset

Convex dataset in Rk another convex dataset in Rk+2.

Good for range query Each range query in Rk

hyperplane based query range query in Rk+2 .

RASP properties Convexity preserving

Queried range (hypercube) is convex RASP transforms the range to another convex (polyhedron)

wTx=a

half space: wTx<=a

The intersection of convex sets is also convex.

illustration of convexity preserving

Original space Encrypted space

Secure query transformation A naïve solution

Based on the convexity preserving property

Problems: (1) A-1 can be probed (2) is . . If a is known, the whole dimension i is breached.

Secure query transformation Enhanced solution

Xk+2 is always positive

(Xi-a) 0 (Xi-a)Xk+2 0 Correspondingly, in the encrypted space

yTy 0,

Problems addressed: (1) A-1 cannot be derived from (2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects the condition (Xi-a) 0

Efficient two-stage query processing

illustrated

Original space Transformed space

Stage1:Querying this boundingbox

A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server.

Stage2:Filter out the junk records

Stage 1: The client calculates the large bounding box;The server uses the index to find the results.Stage 2: filter the initial results with the conditions

yTiy 0 for 1…2k

Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory.

Otherwise, use linear scan with stage 2 filtering.

RASP-based data mining Preserving range query linear

classifier Use the boosting framework to get

strong classifiers (PerturBoost, in ICDM 2013)

Access pattern privacy On database queries

Problem is the same as PIR Attackers may use the access pattern to

breach data confidentiality

Each of previous approaches should handle this problem!

PIR is impractical Solutions based on private

Information retrieval (PIR) PIR is still impractical

For Bucktization approach Based on the architecture of

Hacigumus (SIGMOD02) Hore VLDB04

For range query Privacy concern: reveal the distribution

of value in each bucket “Diffusion”: split buckets and combine

parts of different buckets Trade off: now the server needs to return

more noisy results larger size

For OPE Use queries to find out the

distributions, then break the encryption

For RASP Secure query transformation Attacks to transformed queries

Oblivious RAM Access pattern: read/write data items Setting:

Client has a small secure memory Server has large insecure storage, semi-

honest Data items are encrypted Client cannot hide the accessed locations

An active area

Existing Approaches

Inside a level Some real blocks

Useful data Some dummy blocks

Random data Randomly permuted

Only the client knows the permutation

Dummy Block

Real Block

Real Block

Dummy Block

Real Block

Dummy Block

Dummy Block

Real Block

Existing Approaches

Reading Read a block from

each level One real block. Remaining are

dummy blocks

ClientServer

realdummydummydummydummy

dummy

Existing Approaches

Writing Shuffle

consecutively filled levels.

Write into next unfilled level.

Clear the source levels

Server (before) Server (after)Client

shuffleblocks

Continuous Shuffling

…

To write:

The Problem with Existing Approaches

Integrity guarantee Merkle hash tree

H(H(x1)+H(x2)) , + is string concatenation

Can be stored with tree like structure : index, xml

Hash chains

Query correctness with merkleby Devanbu et. al.

Using merkle tree

Example:5<=q<=10

LUB(q) = 4GLB(q) = 11

Operations: Selections, projections, equijoins, set ops

Issues Works only on data with verification objects Query expressiveness Expensive

Related work Pang et. al (ICDE04, SIGMOD05), using ElGamal

function Sion VLDB05: challenge token F.Li SIGMOD06: freshness

Secure keyword search Simple information retrieval

For a keyword, find the documents containing the keyword

What if the documents are encrypted word by word

and if the keyword is also encrypted

Secure keyword search Song 2000

•Seed is random, different for each Wi•Key idea: Li and Ri are self-verifiable •Advantage of XOR

How to set K?

Setting of ki Ki = Fk’(Wi), k’ is secret User publishes W and k = Fk’(W) Server checks CiW

whether <Li, Fk(Li)> == CiW It reveals nothing if Ci is not the ciphertext

for W. And Li is random for different Wi – server

cannot find any information from Li.

Hidden search In previous schemes, W is revealed

Weakness: each search will have to release k for W Easy to collect information

Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)>

Recent developments Reza 2006

“Searchable symmetric encryption: improved definitions and efficient constructions”

Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA)

Trusted hardware

Possible benefits

Discussion Data confidentiality/access pattern

Restrict cryptographic definition (keyword search) or

Relaxed definition (perturbation, bucketization, OPE, etc.)

It is very difficult to formulate and prove the security of non-traditional approaches Do we need to reformulate the security

model? and how?

secure data outsourcing

Documents