secure data outsourcing
DESCRIPTION
Secure Data Outsourcing. Outline. Motivation Background Research issues Summary. Motivation. Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well More and more data service providers Low cost – cloud computing - PowerPoint PPT PresentationTRANSCRIPT
Secure Data Outsourcing
Outline Motivation Background Research issues Summary
Motivation Cost of maintaining/mining large data
4-5 times of the cost of data acquisition DBAs are paid well
More and more data service providers Low cost – cloud computing
Maintain one database for one user multiple users Examples:
Alentus.com Datapipe.com Discountasp.net …
Concerns about data security and privacy Untrusted service provider
Un-trusted service provider Lazy: incentives to perform less Curious: incentives to acquire
information Malicious:
Denial of service Incorrect results Possibly compromised
Challenges Data confidentiality
Data need to be encrypted (?) Utility of protected data?
Query utility Mining utility
Access pattern privacy Integrity
Data integrity Query integrity
Correct Complete Fresh
Why is it hard for query services? Arbitrary expressivity
SQL statements Often, restricted for certain type of query
for simplicity (e.g. range query, knn query)
Cost Communication Computation (server side vs client side)
Why it is hard for mining services? Many data mining models
Different utilities to preserve No one-size-for-all solutions
Data confidentiality Bucketization method (crypto-index) Order preserving encryption Perturbations
Bucketization method Hacigumus (SIGMOD02)
Main steps Partition sensitive attributes
Order preserving: supports comparison Random: query rewriting becomes hard
Build index on the partitions Rewrite queries to target partitions
‘john doe’ 105 Select * from T’ where name=105
Execute queries and return results Prune/post-process results on client
Trade off between confidentiality and overhead Larger partition increased privacy
increased overheads
Order preserving encryption Agrawal2004, Boldyreva2009 The set of data is securely
transformed so that the order is preserved but the distribution and domain are changed
Benefits: indexing/searching on OPE encrypted data
Weakness: once the original distribution is known, OPE is broken
Not attribute-wise order preserving Order preserving encryption (OPE, Agrawal et al
2004) is not resilient to distribution-based attacks
Original Xi distribution is known Transformed Xi’ distribution
OPE
Bucket basedEstimation
Data perturbation Definition
1. randomly change the original data2. the attacker cannot effectively recover
the original data 3. the desired properties are preserved
Techniques Single dimension: noise addition Multidimensional
Geometric perturbation Random projection RASP random space perturbation
Noise addition Y = X+ R
X: original data column, R: random noise (distribution published), Y: published data
Applications in data mining Reconstructing column distribution
Rakesh Agrawal SIGMOD 2000 Applied to privacy-preserving decision tree, naïve
bayes classifier
Attacks Spectral filtering (Kargupta ICDM 2004) PCA reconstruction (Huang SIGMOD2005)
Multiplicative perturbations Geometric data perturbation for
outsourced data mining Random Projection RASP perturbation for query services
(range query, kNN query).
Perturbation-based framework
Mining service
Geometric data perturbation Y=RX+T+D
R: secret rotation matrix (preserve Euclidean distances) T: secret random translation matrix, D: secret random
noise matrix Distances are approximately preserved (D) Resilient to most attacks to rotation perturbation
Applications Outsourced privacy preserving data mining, applicable
for many classification and clustering algorithms
Attacks Population based attacks (when covariance matrix is
revealed)
Random Projection Y=AX+D
A: random projection, e.g., entries from N(0,1)
Distances are approximately preserved Applications
Many classification and clustering algorithms Worse accuracy than geometric perturbation
Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record)
Attacks Possibly more resilient than other two
perturbation methods But utility (distance) is not well preserved
RASP perturbationk-dimensional numeric data, n records, represented as a k x n matrix, x: a record
(1) Extend x to k+2 dimensions - (K+1) th dimension is always 1 – homogeneous dimension- (K+2) th dimension v is a real random number drawn from
(2) Encryption
- A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values
- A is shared by all records
Properties Not an OPE Preserves convexity of the dataset
Convex dataset in Rk another convex dataset in Rk+2.
Good for range query Each range query in Rk
hyperplane based query range query in Rk+2 .
RASP properties Convexity preserving
Queried range (hypercube) is convex RASP transforms the range to another convex (polyhedron)
wTx=a
half space: wTx<=a
The intersection of convex sets is also convex.
illustration of convexity preserving
Original space Encrypted space
Secure query transformation A naïve solution
Based on the convexity preserving property
Problems: (1) A-1 can be probed (2) is . . If a is known, the whole dimension i is breached.
Secure query transformation Enhanced solution
Xk+2 is always positive
(Xi-a) 0 (Xi-a)Xk+2 0 Correspondingly, in the encrypted space
yTy 0,
Problems addressed: (1) A-1 cannot be derived from (2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects the condition (Xi-a) 0
Efficient two-stage query processing
illustrated
Original space Transformed space
Stage1:Querying this boundingbox
A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server.
Stage2:Filter out the junk records
Stage 1: The client calculates the large bounding box;The server uses the index to find the results.Stage 2: filter the initial results with the conditions
yTiy 0 for 1…2k
Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory.
Otherwise, use linear scan with stage 2 filtering.
RASP-based data mining Preserving range query linear
classifier Use the boosting framework to get
strong classifiers (PerturBoost, in ICDM 2013)
Access pattern privacy On database queries
Problem is the same as PIR Attackers may use the access pattern to
breach data confidentiality
Each of previous approaches should handle this problem!
PIR is impractical Solutions based on private
Information retrieval (PIR) PIR is still impractical
For Bucktization approach Based on the architecture of
Hacigumus (SIGMOD02) Hore VLDB04
For range query Privacy concern: reveal the distribution
of value in each bucket “Diffusion”: split buckets and combine
parts of different buckets Trade off: now the server needs to return
more noisy results larger size
For OPE Use queries to find out the
distributions, then break the encryption
For RASP Secure query transformation Attacks to transformed queries
Oblivious RAM Access pattern: read/write data items Setting:
Client has a small secure memory Server has large insecure storage, semi-
honest Data items are encrypted Client cannot hide the accessed locations
An active area
Existing Approaches
Inside a level Some real blocks
Useful data Some dummy blocks
Random data Randomly permuted
Only the client knows the permutation
Dummy Block
Real Block
Real Block
Dummy Block
Real Block
Dummy Block
Dummy Block
Real Block
Existing Approaches
Reading Read a block from
each level One real block. Remaining are
dummy blocks
ClientServer
realdummydummydummydummy
dummy
Existing Approaches
Writing Shuffle
consecutively filled levels.
Write into next unfilled level.
Clear the source levels
Server (before) Server (after)Client
shuffleblocks
Continuous Shuffling
…
To write:
The Problem with Existing Approaches
Integrity guarantee Merkle hash tree
H(H(x1)+H(x2)) , + is string concatenation
Can be stored with tree like structure : index, xml
Hash chains
Query correctness with merkleby Devanbu et. al.
Using merkle tree
Example:5<=q<=10
LUB(q) = 4GLB(q) = 11
Operations: Selections, projections, equijoins, set ops
Issues Works only on data with verification objects Query expressiveness Expensive
Related work Pang et. al (ICDE04, SIGMOD05), using ElGamal
function Sion VLDB05: challenge token F.Li SIGMOD06: freshness
Secure keyword search Simple information retrieval
For a keyword, find the documents containing the keyword
What if the documents are encrypted word by word
and if the keyword is also encrypted
Secure keyword search Song 2000
•Seed is random, different for each Wi•Key idea: Li and Ri are self-verifiable •Advantage of XOR
How to set K?
Setting of ki Ki = Fk’(Wi), k’ is secret User publishes W and k = Fk’(W) Server checks CiW
whether <Li, Fk(Li)> == CiW It reveals nothing if Ci is not the ciphertext
for W. And Li is random for different Wi – server
cannot find any information from Li.
Hidden search In previous schemes, W is revealed
Weakness: each search will have to release k for W Easy to collect information
Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)>
Recent developments Reza 2006
“Searchable symmetric encryption: improved definitions and efficient constructions”
Completely solved this problem, with a solution indistinguishability under chosen ciphertext attack (IND-CCA)
Trusted hardware
Possible benefits
Discussion Data confidentiality/access pattern
Restrict cryptographic definition (keyword search) or
Relaxed definition (perturbation, bucketization, OPE, etc.)
It is very difficult to formulate and prove the security of non-traditional approaches Do we need to reformulate the security
model? and how?