a primer on data masking techniques for numerical data krish muralidhar gatton college of business...
TRANSCRIPT
A PRIMER ON DATA MASKING TECHNIQUES FOR NUMERICAL DATA
Krish Muralidhar
Gatton College of Business & Economics
My Co-author
I would first like to acknowledge that most of my work in this area is with my co-author Dr. Rathindra Sarathy at Oklahoma State University
Introduction
Data masking deals with techniques that can be used in situations where data sets consisting of sensitive (confidential) information are “masked”. The masked data retains its usefulness without compromising privacy and/or confidentiality. The masked data can be analyzed, shared, or disseminated without risk of disclosure.
Objectives of Data Masking
Minimize risk of disclosure resulting from providing access to the data
Maximize the analytical usefulness of the data
What this talk is not about … We are talking about protecting data that is
made available to users, shared with others, or disseminated to the general public
We are not dealing with unauthorized access to the data
Encryption is not a solution We cannot perform analysis on encrypted data
There are a few exceptions To perform analysis on the data, it must be
decrypted Decrypted data offers no protection
What this talk is not about … Since we have the data set, we know the
characteristics of the data set. We are trying to create a new data set that essentially contains the same characteristics as the original data set. We are not trying discern the characteristics of the original data set using the information in the masked data. In other words, I will not be talking about Agrawal and Srikant Or about the “distributed data” situation
In addition, since most of you are probably familiar with the CS literature on this area, I will focus on the literature in the “statistical disclosure limitation” area
Purpose of Dissemination
It is assumed that the data will be used primarily for analysis at the aggregate level using statistical or other analytical techniques The data will not be accurate at the
individual record level
Aggregate versus Micro Data The organization that owns the data could
potentially release aggregate information about the characteristics of the data set
The users can still perform some types of analyses using the aggregate data, but limits the ability of the users to perform ad hoc analysis
Releasing the microdata provides the users with the flexibility to perform any type of analysis
In this talk, we assume that the intent is to release microdata
The Data
Typically, the data is historical and consists of Categorical variables (or attributes) Numerical variables
Discrete variables Continuous variables
In cases where identity is not to be revealed, key identification variables will be removed from the data set (de-identified)
De-identification does not necessarily prevent Re-identification
The common misconception is that, in order to prevent disclosure, all that is required is to remove “key identifiers”. However, even if the “key identifiers” are removed, in many cases it would be easy to indentify an individual using external data sources Latanya Sweeney’s work on k-anonymity The availability of numerical data makes it
easy to re-identify records through record linkage
Data Masking
Since de-identification alone does not prevent disclosure, it is necessary to “mask” the original data so that an intruder, even using external sources of data, cannot Identify a particular released record as
belonging to a particular individual Estimate the value of a confidential
variable for a particular record accurately
Who is an intruder?
Every user is potentially an intruder
Since microdata is released, we cannot prevent the user from performing any type of analysis on the released data
Must account for disclosure risk from any and all types of analyses Worst case scenario
The focus of our research
Data masking techniques are used to mask all types of data (categorical, discrete numerical, and continuous numerical)
The focus of our research, and of this talk, is data masking for continuous numerical data
The Data Release Process
Identify the data set to be released and the sensitive variables in the data
Release all aggregate information regarding the data Characteristics of individual variables Relationship measures Any other relevant information
Release non-sensitive data Since my focus is on numerical microdata, I will assume
that all categorical and discrete data are either released unmasked or are masked prior to release
Release masked numerical microdata
Characteristics of a good masking technique
Minimize disclosure risk (or maximize security)
Minimize information loss (or maximize data utility)
Other characteristics Must be easy to use
The user must be able to analyze the masked data exactly as he/she would the original data
Must be easy to implement
Disclosure Risk
Dalenius defines disclosure as having occurred if, using the released data, an intruder is able to identify an individual or estimate the value of a confidential variable with a greater level of accuracy than was possible prior to such data release
Minimum Disclosure Risk
A data masking technique minimizes disclosure risk, IFF, the release of the masked microdata does not allow an intruder to gain additional information about an individual record over and above what was already available (from the release of aggregate information, the non-confidential variables, and the masked categorical variables) Does not mean that the disclosure risk from the entire
data release process is minimum; only that the disclosure risk from releasing microdata is minimized
Can be achieved in practice
Practical Measure of Disclosure Risk Identity Disclosure
Re-identification rate Value disclosure
Variability in the confidential attribute explained by the masked data
Minimum Information Loss
Information loss is minimized IFF, for any arbitrary analysis (or query), the response from the masked data is exactly the same as that from the original data
Impossible to achieve in practice Since an arbitrary analysis may involve a
single record, the only way to achieve this objective is to release unmasked data
Information Loss … continued In practice, we attempt to minimize information
loss by maintaining the characteristics of the masked data to be the same as that of the original data
From a statistical perspective, we attempt to maintain the masked data to be “similar to” the original data so that responses to analyses using the masked data will be the approximately same as that using the original data Maintain the distribution of the masked data to be
the same as the original data
Some Practical Measures of Information Loss
Ability to maintain The marginal distribution Relationships between variables
Linear Monotonic Non-monotonic
Simple Masking Approaches
Noise addition Micro-aggregation Data swapping Other similar approaches
Any approach in which the masked value yij (the masked value for the jth variable of the ith record) is generated as a function of xi.
An Illustrative Example
A data set consisting of 2 categorical variables, 1 discrete, and 3 (confidential) numerical variables
50000 records
Relationships
Relationships are not necessarily linear Measured by
both product moment and rank order correlation
Simple Noise Addition
The most rudimentary method of data masking. Add random noise to every confidential value of the form yi = xi + ei
Typically e ~ Normal(0, d*Var(Xi)) The selection of d specifies the level of noise.
Large d indicates higher level of masking The variance is changed resulting in biased
estimates Many variations exist
Problems with Noise Addition The addition of noise results in an
increase in variance This can be addressed easily, but there
are other issues that cannot be, such as The marginal distribution is modified All Relationships are attenuated
Looks good …
Everything looks good Bias is small Relationships seem to be maintained
So what is the problem? The problem is security Since very little noise is added, there is
very little protection afforded to the records
High Disclosure Risk
The correlation between the original and masked values is of the order of 0.99. The masked values themselves are excellent predictors of the original value. Little or no “masking” is involved.
Disclosure Risk versus Information Loss
Adding very little noise (10% of the variance of the individual variable) results in low information loss, but also results in high disclosure risk
In order to decrease disclosure risk, it would be necessary to increase the noise (say 50% of the variance), but that would result in higher information loss
Results for Noise Addition(Noise level = 50%)
At first glance, it does not seem too bad, but on closer observation, we notice that there are lots of negative values that did not exist in the original data Negative values
can be addressed
Mortgage Balance versus Asset balance (Noise Added)
Correlation
There is a considerable difference between the original and masked data. The correlations are considerably lower.
Marginal Distribution of Home Value
The marginal distribution is completely modified
This is an unavoidable consequence of any noise “addition” procedure
Summary
In summary, noise addition is a rudimentary procedure that is easy to implement and easy to explain. There is always a trade-off between disclosure risk and information loss. If the disclosure risk is low (high) then the corresponding information loss is high (low).
Unfortunately, this is an inherent characteristic of all noise based methods of the form Y = f(X,e) whether the noise is additive or multiplicative or some other form
Sufficiency based Noise Addition
Recently, we have developed a new technique that is similar to noise addition, but maintains the mean vector and covariance matrix of the masked data to be the same as the original data Offers the same characteristics as noise
addition, but assures that results for traditional statistical analyses using the masked data will be the same as the original data
Sufficiency Based Noise Addition Model:
yi = γ + αxi + βsi + εi
The only parameter that must selected is the “proximity parameter” α. All other parameters are dictated by the selection of this parameter
The Proximity Parameter
The parameter α (0 < α < 1) dictates the strength of the relationship between X and Y. When α = 1, Y = X. When α = 0, the perturbed variable is
generated independent of X (the GADP model to be discussed later)
We provide the ability to specify α to achieve any degree of proximity between these two extremes
Other Model Parameters
γ = (1 – α) – β
β = (1 – α)(σXS/σ2SS)
ε ~ Normal(0, (1 – α2)((σXS)2/σ2
SS) Can be generated from other distributions
ε orthogonal to X and S
X S
Note that …
In order to maintain sufficient statistics, it is NECESSARY that the model for generating the perturbed values MUST be specified in this manner
Disclosure Risk
There is a direct correspondence between the proximity parameter α and the level of noise added in the simple noise addition approach. This procedure will result in incremental disclosure risk except when α = 0
The level of noise added is approximately equal to (1 – α2)
Information Loss
Information loss characteristics of the sufficiency based approach is exactly the same as that of the simple noise addition approach with one major difference. Results of statistical analyses for which the mean vector and covariance matrix are sufficient statistics will be exactly the same using the masked data as they are using the original data.
Simple versus Sufficiency Based Noise Addition If noise addition will be used to mask the
data, we should always use sufficiency based noise addition (and never simple noise addition). It provides all the same characteristics of simple noise addition with one major advantage that, for many traditional statistical analyses, it provides the guarantee that the masked data will yield the same results as the original data.
Microaggregation
Replace the values of the variables for a set of k records in close proximity with the average value of k records Many different methods of determining close proximity
Univariate microaggregation where each variable is aggregated individually
Multivariate microaggregation where the values of all the confidential variables for a given set of records are aggregated
Results in variance reduction and attenuation of covariance All relationships are modified … some correlations higher others
are lower Poor security even for relatively large k Consistent with the idea of “k anonymity” since at least k
records in the data set will have the same values
Univariate MA (k = 5) ExampleGood information loss characteristics but poor disclosure risk characteristics
Univariate MA (k = 100) ExampleWorse information loss characteristics but better disclosure risk characteristics
Bill Winkler at the Census Bureau has shown that the risk of identity disclosure is very high even with large k
Rank Based Data Swapping
Swap values of variables within a specified proximity When the swapped
values are in close proximity, it results in low information loss but high disclosure risk and vice versa
The proximity is usually specified by the rank of the record
The advantage of data swapping is that it does not change (or perturb) the values; the original values are used The marginal distribution
of the masked data is exactly the same as the original
Unfortunately, it results in high information loss and offers poor disclosure risk characteristics
Data Swapping (Rank Proximity = 0.2% or the closest 100 records)
Information loss is low Unfortunately disclosure
risk is very high The correlation between
original and masked net asset value is 0.999
Data Swapping (Rank Proximity = 10% or the closest 5000 records)
Now information loss is very high, but disclosure risk is better
The problem with these approaches
There is an inherent problem with all approaches that generate the perturbed value as a function of the original value …. Y ~ f(X,e) These include all noise addition approaches, data swapping,
microaggregation, and any variation of these approaches
Using Delanius’ definition of disclosure risk, all these techniques result in disclosure
If we attempt to improve disclosure risk, it will adversely affect information loss (and vice versa)
What we need …
Is a method that will ensure that the released of the masked data does not result in any additional disclosure, but provides characteristics for the masked data that closely resemble the original data
From a statistical perspective, at least theoretically, there is a relatively easy solution
Conditional Distribution Approach
Data set consisting of a set of non-confidential variables S and confidential variables X
Identify the joint distribution f(S,X) Compute the conditional distribution f(X|S) Generate the masked values yi using f(X|S = si)
When S is null, simply generate a new data set with the same characteristics as f(X)
Then the joint distribution of (S and Y) is the same as that of (S and X) f(S,Y) = f(S,X) Little or no information loss since the joint distribution of
the original and masked data are the same
Disclosure Risk of CDA
When the masked data is generated using CDA, it can be verified that f(X|Y,S,A) = f(X|S,A) Releasing the masked microdata Y does not
provide any new information to the intruder over and above the non-confidential variables S and A (the aggregate information regarding the joint distribution of S and X)
CDA is the answer … but
The CDA approach results in very low information loss and minimizes disclosure risk and represents a complete solution to the data masking problem
Unfortunately, in practice Identifying f(S,X) may be very difficult Deriving f(X|S) may be very difficult Generating yi using f(X|S) may be very difficult
In practice, it is unlikely that we can use the conditional distribution approach
Model Based Approaches
Model based approaches for data masking essentially attempt to model the data set by using an assumed f*(S,X) for the joint distribution of (S and X), derive f*(X|S), and generate the masked values from this distribution The masked data f(S,Y) will have the joint
distribution f*(S,X) rather than the true joint distribution f(S,X)
If the data is generated using f*(X|S) then the masking procedure minimizes disclosure risk since f(X|Y,S,A) = f(X|S,A)
Disclosure risk example
Assume that we have one non-confidential variable S and one confidential variable X
Y = (a × S) + e (where e is the noise term)
We will always get better prediction if we attempt to predict X using S rather than Y (since Y is noisier than S)
Since we have access to both S and Y, and since S would always provide more information about X than Y, an intelligent intruder will always prefer to use S to predict X than Y
More importantly, since Y is a function of S and random noise, once S is used to predict X, including Y will not improve your predictive ability
Model Based Masking Methods Methods that we have developed and I
will be talking about General additive data perturbation Copula based perturbation Data shuffling
Other Methods PRAM Multiple imputation Skew t perturbation
General Additive Data Perturbation(GADP) A linear model based approach. Can maintain the
mean vector and covariance matrix of the masked data to be exactly the same as the original data The same as sufficiency based noise addition with
proximity parameter = 0 Ensures that the results of all traditional,
parametric statistical analyses using the masked data are exactly the same as that using the original data
Ensure that the release of the masked microdata results in no incremental disclosure
Procedure
From original data estimate the linear regression model X = β0 + β1S + ε. Let b0 and b1 represent the estimates of β0 and β1 and let Σee represent estimate of the covariance of the noise term ε.
Generate a set of noise terms e with mean vector 0 and covariance matrix (exactly equal to) Σee and also orthogonal to both X and S. Distribution of e is immaterial although typically MV normal.
Generate yi = b0 + b1Si + ei (i = 1 , 2, …, N) The mean vector and covariance matrix of (S,Y) is exactly
the same as (S,X) In the original GADP, these measures were maintained only
asymptotically. Burridge (2003) suggested the methodology for maintaining these exactly. We modified this further to ensure minimum disclosure risk (Muralidhar and Sarathy 2005).
Minimum Disclosure Risk
GADP results in minimizing disclosure risk. We can show that an intruder would get the “best estimate” of the confidential values using just the non-confidential variables. The masked variables provide no additional information.
Even if you …
Had say 90% of the entire data set, you would not be able to predict the value of the confidential variables for the remaining 10% with any greater accuracy than you would using only the non-confidential data
Had 100% of all confidential variables except one AND 90% of the values for the last confidential variable, you would not be able to predict the confidential value of remaining records with any greater accuracy than you would using only the non-confidential variables.
(Lack of) Information Loss
By maintaining the mean vector and covariance matrix of the two data sets to be exactly the same, for any statistical analysis for which the mean vector and covariance matrix are sufficient statistics, we ensure that the parameter estimates using the masked data will be exactly the same as the original data
But …
Unfortunately, the marginal distribution of the original data set is altered significantly. In most situations, the marginal distribution of the masked variable bears little or no relationship to the original variable The data also could have negative values
when the original variable had only positive values
Marginal Distribution of Home Value
• The change in the marginal distribution means that other analyses pertaining to the distribution of the confidential variables are not maintained– Residual analysis from regression would be very different
Negative values that did not exist
in the original data
Non-Linear Relationships
Since a linear model is used, any non-linear relationships that may have been present in the data are modified (linearized)
GADP … Useful … But …
GADP is useful in a limited context. If the confidential variables do not exhibit significant deviations from normality, then GADP would represent a good solution to the problem
In other cases, GADP represents a limited solution to the specific users who will use the data mainly for traditional statistical analysis
Improving GADP
We would like the masking procedure to provide some additional benefits (while still minimizing disclosure risk) Maintain the marginal distribution Maintain non-linear relationships
To do this, we need to move beyond linear models Multiplicative models are not very useful since,
in essence, they are just variations of the linear model
Copula Based GADP
In statistics, copulas have traditionally been used to model the joint distribution of a set of variables with arbitrary marginal distributions and a specified dependence characteristics
the ability to maintain the marginal, nonnormal distribution of the original attributes to be the same after masking and to preserve certain types of dependence between the attributes
Characteristics of the C-GADP C-GADP minimizes disclosure risk C-GADP provides the following
information loss characteristics The marginal distribution of the confidential
variables is maintained All monotonic relationships are preserved
Rank order correlation Product moment correlation
Non-monotonic relationships will be modified
An Important Extension
Consider a situation where we have a confidential variable X and a set of non-confidential variables S. If we assume that the MV Copula is appropriate for modeling the data, then the perturbed data Y can be viewed as an independent realization from f(X|S). The marginal of Y is simple a different realization from the same marginal as X. This being the case, reverse map the original values of X in place of the masked values Y. Now the “values” of Y are the same as that of X, but they have been “shuffled”.
Data Shuffling(US Patent 7200757)
In the above, we use the multivariate normal copula to generate YP.
Characteristics of Data Shuffling Offers all the benefits as CGADP
Minimum disclosure risk Information loss
Maintains the marginal distribution Maintain all monotonic relationships
Additional benefits There is no “modification” of the values. The original
values are used The marginal distribution of the masked data is
exactly the same as the original data Implementation can be performed using only the ranks
A small example
Some shuffled values are far apart, others are closer Impossible to predict
original position after the fact which assures low disclosure risk
Rank order correlation pre and post masking are very close. Improves with the size of the data set
X is less correlated with Y and more correlated with S
Advantages of Data Shuffling Data shuffling is a hybrid (perturbation
and swapping), non-parametric (can be implemented only with rank information) technique for data masking that minimizes disclosure risk and offers the lowest level of information loss among existing methods of data masking Will not maintain non-monotonic relationships Does not preserve tail dependence
Can be overcome by using t-copula instead of normal copula
Practically Viable
Data shuffling can be implemented easily even for relatively large data sets. We are in the process of developing two versions of software based on Data shuffling Java based for large applications Excel based for smaller applications
Future Research
Investigate other methods for modeling the joint distribution of the variables to reduce information loss further. Other copula functions? Some other approach?
Investigate non-statistical approaches for producing a masked data set that closely resembles the original data (while minimizing disclosure risk)
Masking methods for discrete numerical data
Some Important References
Dalenius, T., “Towards a methodology for statistical disclosure control,” Statistisktidskrift, 5, 429–444, 1977.
Fuller, W. A., “Masking procedures for microdata disclosure limitation,” Journal of Official Statististics, 9, 383–406, 1993.
Rubin, D. B., “Discussion of statistical disclosure limitation,” Journal of Official Statistics, 9, 461–468, 1993.
Moore, R. A., “Controlled data swapping for masking public use microdata sets,” Research report series no. RR96/04, U.S. Census Bureau, Statistical Research Division, Washington, D.C., 1996.
Burridge, J., “Information preserving statistical obfuscation,” Statistics and Computing, 13, 321–327, 2003.
Domingo-Ferrer, J. and J.M. Mateo-Sanz, “Practical data-oriented microaggregation for statistical disclosure control,” IEEE Transactions on Knowledge and Data Engineering, 14, 189-201, 2002.
Our Publications Relating to Data Masking Muralidhar, K. and R. Sarathy, " Generating Sufficiency-based Non-
Synthetic Perturbed Data," Transactions on Data Privacy, 1(1), 17-33, 2008.
Muralidhar, K. and R. Sarathy, "Data Shuffling- A New Masking Approach for Numerical Data," Management Science, 52(5), 658-670, 2006.
Muralidhar, K. and R. Sarathy, “A Comparison of Multiple Imputation and Data Perturbation for Masking Numerical Variables,” Journal of Official Statistics, 22(3), 507-524, 2006.
Muralidhar, K. and R. Sarathy, " A Theoretical Basis for Perturbation Methods," Statistics and Computing, 13(4), 329-335, 2003.
Sarathy, R., K. Muralidhar, and R. Parsa, "Perturbing Non-Normal Confidential Attributes: The Copula Approach," Management Science, 48(12), 1613-1627, 2002.
Muralidhar, K., R. Parsa, and R. Sarathy, "A General Additive Data Perturbation Method for Database Security," Management Science, 45(10), 1399-1415, 1999.
Muralidhar, K., D. Batra, and P. Kirs, “Accessibility, Security, and Accuracy in Statistical Databases: The Case for the Multiplicative Fixed Data Perturbation Approach,” Management Science, 41(9), 1549-1564,1995.
Other Related Research
Assessing disclosure risk Muralidhar, K. and R. Sarathy, "Security of Random Data
Perturbation Methods," ACM Transactions on Database Systems, 24(4), 487-493, 1999.
Sarathy, R. and K. Muralidhar, "The Security of Confidential Numerical Data in Databases," Information Systems Research, 13(4), 389-403, 2002.
Li, H., K. Muralidhar, and R. Sarathy, “Assessment of Disclosure Risk when using Confidentiality via Camouflage,” Operations Research, 55(6), 1178-1182, 2007.
Framework for evaluating masking techniques Muralidhar, K. and R. Sarathy, “A Theoretical Comparison of
Data Masking Techniques for Numerical Microdata,” to be presented at the 3rd IAB Workshop on Confidentiality and Disclosure - SDC for Microdata, Nuremberg, Germany, 2008
Web Site URL
You can many of our papers and presentations at our web site:
http://gatton.uky.edu/faculty/muralidhar/maskingpapers/
I will be happy to share any papers or presentations that are not available on the web site.
Conclusion
There are a host of techniques that are available for masking numerical data. These techniques have a long history in the statistical disclosure limitation literature. There is considerable overlap between the data masking research in the statistical disclosure limitation research community and the privacy preserving data mining research in the CS community. Unfortunately, there seems to be only a limited cooperation between the researchers in the two fields. I believe that each field can make a significant contribution to the other. I hope that this presentation contributes to enhancing the discussion between CS and SDL researchers … at least at UK.