out of 23
Post on 21-Aug-2014
Embed Size (px)
- 1 Random Data Perturbation Techniques and Privacy Preserving Data Mining Gunjan Gupta (Authors: H. Kargupta, S. Datta, Q. Wang & K. Sivakumar) April 26, 2005
- 2 Privacy & Good Service: Often Conflicting Goals Privacy Customer: I dont want you to share my personal information with anyone. Business: I dont want to share my data with a competitor. Quantity, Cost & Quality of Service Customer: I want you to provide me lower cost of service and good quality. and at lower cost. Paradox: lower cost often comes from being able to use/share sensitive data that can be used or misused: Provide better service by predicting consumer needs better, or sell information to marketers. Optimize load sharing between competing utilities or preempting competition. Doctor saving patient by knowing patient history or insurance companies declining coverage to individuals with preexisting conditions.
- 3 Can we use privacy sensitive data to optimize cost and quality of a service without compromising any privacy? Central Question:
- 4 Short Answer: No!
- 5 Long Answer: Maybe compromise a small amount of privacy (low cost increase) to improve quality and cost of service (high cost savings) substantially.
- 6 Why anonymous exact records not so secure? Example : medical insurance premium estimation based on patient history Predictive fields often generic: age, sex, disease history, first two digits of zip code (not allowed in Germany). no. of kids etc. Specifics such as record id (key), name, address omitted. This could be easily broken by matching non-secure records with secure anonymous records: Susan Calvin, 121 Norwood Cr. Austin, TX-78753 Hi, I am Susan, and here are pictures of me, my husband, and my 3 wonderful kids from my 43rd birthday party! Female, 43, 3 kids, 78---,married, anonymous medical record 1 Female, 43, 2 kids, 78---, single anonymous medical record 2 Yellowpages Personal website Anonymous privacy preserving records Susan Calvin, 43, 3 kids, Address, 78733, now labeled med. Records! Internal Human + Automated hacker Broken Exact record
- 7 Two approaches to Privacy Preserving Distributed: Suitable for multi-party platforms. Share sub-models. Unsupervised: Ensemble Clustering, Privacy Preserving Clustering etc. Supervised: Meta-learners, Fourier Spectrum Decision Trees, Collective Hierarchical Clustering and so on.. Secure communication based: Secure sum, secure scalar product Random Data Perturbation: Our focus Perturb data by small amounts to protect privacy of individual records. Preserve intrinsic distributions necessary for modeling.
- 8 Recovering approximately correct anonymous features also breaks privacy Somewhat inexactly recovered anonymous record values might also be sufficient: Susan Calvin, 121 Norwood Cr. Austin, TX-78753 Hi, I am Susan, and here are pictures of me, my husband, and my 3 wonderful kids from my 43rd birthday party! yellowpages Personal website Denoised privacy preserving records Susan Calvin, 43, 3 kids, Address, 78733, now labeled med. Records! Internal Human + Automated hacker Broken Exact record Female, 44.5, 3.2 kids, 78---,married, anonymous medical record 1 Female, 42.2, 2.1 kids, 78---, single, anonymous medical record 2
- 9 Anonymous records (with or without) small perturbations not secure: not a recently noticed phenomena 1979, Denning & Denning: The Tracker: A Threat to Statistical Database Security Show why anonymous records are not secure. Show example of recovering exact salary of a professor from anonymous records. Present a general algorithm for an Individual Tracker. A formal probabilistic model and set of conditions that make a dataset support such a tracker. 1984, Traub & Yemin: The Statistical Security of a Statistical Database: No free lunch: perturbations cause irrecoverable loss in model accuracy. However, the holy grail of random perturbation: We can try to find a perturbation algorithm that best trades off between loss of privacy vs. model accuracy.
- 10 Recovering perturbed distributions: Earlier work Reconstructing Original Distribution from Perturbed Ones. Setup: N samples U1, U2, U3.. Xn N noise values V1, V2, V3.. Vn all taken from a public(known) distribution V. Visible noisy data: W1=U1+V1, W2=U2+V2 . . Assumption: Such noise can allow you to recover the distribution of X1,X2,X3 ..Xn, but not the individual records. Two well known methods and definitions: Agrawal & Srikant: Interval based: Privacy(X) at Confidence 0.95= X2-X1 Agrawal & Aggarwal: Distributional Privacy(X)=2h(x) X1 X2 f(x) f(x)
- 11 Interval Based Method: Agrawal & Srikant in more detail N samples U1, U2, U3.. Xn N noise values V1, V2, V3.. Vn all taken from a public(known) distribution V. W1=U1+V1, W2=U2+V2 . . Visible noisy data: W1, W2, W3 .. Given: noise function fV , using Bayes Rule, we can show that the cumulative posterior distribution function of u in terms of w (visible) and fV , and unknown desired function fu , Differentiating w.r.t. u we get an important recursive definition: Notation issue (in paper): f simply means approximation of true f, not derivative of f !
- 12 Interval Based Method: Agrawal & Srikant in more detail Seed with a uniform distribution for J=0 sum over discrete z intervals instead of integral for speed Algorithm in practice: replaced integration with summation over i.i.d samples STEP J+1 STEP J Converges to a local minima? Different than uniform initialization might give a different result. Not explored by authors. For large enough samples, hope to get close to true distribution. Stop when fU(J+1) fU(J) becomes small.
- 13 Interval Based Method: Good Results for a variety of noises
- 14 Revisiting an Essential Assumption in the Random Perturbation Assumption: Such noise can allow you to recover the distribution of X1,X2,X3 ..Xn, but not the individual records. The Authors in this paper challenge this assumption. Claim randomness addition can be mostly visual and not real: Many simple forms of random perturbations are breakable.
- 15 Exploit predictable properties of Random data to design a filter to break the perturbation encryption? Spiral data Random data All eigen-values close to 1!
- 16 Spectral Filtering: Main Idea: Use eigen-values properties of noise to filter U+V data Decomposition of eeigen-values of noise and original data Recovered data
- 17 Decomposing eigen-values: separating data from noise Let U and V be the m x n data and noise matrices P the perturbed matrix UP= U+V Covariance matrix of UP = UP T UP = (U+V) T (U+V) = UT U + VT U + UT V + UT U Since signal and noise are uncorrelated in random perturbation, for large no. of observations: VT U ~ 0 and UT V ~ 0, therefore UP T UP = UT U + VT V Since the above 3 matrices are correlation matrices, they are symmetric and positive semi-definite, therefore, we can perform eigen decomposition:
- 18 With bunch of algebra and theorems from Matrix Perturbation theory, authors show that in the limit (lots of data).. Giving us the following algorithm: 1. Find a large no. of eigen values of the perturbed data P. 2. Separate all eigen values inside min and max and save row indices IV 3. Take the remaining eigen indices to get the peturbed but not noise eigens coming from true data U: save their row indices IU 4. Break perturbed eigenvector matrix QP into AU = QP (IU), AV = QP (IV). 5. Estimate true data as projection : Wigners law: Describes distribution of eigen values for normal random matrices: eigen values for noise component V stick in a thin range given by min and max (show example next page) with high probability. Allows us to compute min and max. Solution!
- 19 Exploit predictable properties of Random data to design a filter to break the perturbation encryption? Spiral data Random data All eigen-values close to 1!
- 20 Results: Quality of Eeigen values recovery Only the real eigens got captured, because of the nice automatic thresholding !
- 21 Results: Comparison with Aggarwals reproduction Agrawal & Srikant (no breaking of encryption) Agrawal & Srikant (estimated from broken encryption)
- 22 Discussion Amazing amount of experimental results and comparisons presented by authors in the Journal version. Extension to a situation where perturbing distribution form is known but exact first , second or higher order statistics not known: discussed but not presented. Comparison of performance with other obvious techniques for noise reduction in signal processing community: Moving Averages and Weiner