data perturbation an inference control method for database security
DESCRIPTION
Data Perturbation An Inference Control Method for Database Security. Dissertation Defense Bob Nielson Oct 23, 2009. I. Introduction. Most security concerns can be handled with the grant command. Others require a view approach - PowerPoint PPT PresentationTRANSCRIPT
Data PerturbationData PerturbationAn Inference Control Method An Inference Control Method
for Database Securityfor Database SecurityDissertation DefenseDissertation Defense
Bob NielsonBob Nielson
Oct 23, 2009Oct 23, 2009
I. IntroductionI. Introduction
• Most security concerns can be handled with the grant command.
• Others require a view approach
• But what happens if we wish to disclose partial information in a table field but not the individual records?
I. Introduction – The ProblemI. Introduction – The Problem
• The problem is to allow for statistical analysis of data but still protecting individual records.
• Example: Given a database of cancer patients. Allow for a researcher to know what the cancer rate is, but not that patient X has cancer.
I. Introduction – The ProblemI. Introduction – The Problem
Name Dept Sex Salary
Bob CS M 30,000
Fred CS M 100,000
Mary CS F 50,000
Tim IT M 50,000
Tom IT M 60,000
Martha IT F 70,000
Ken IT M 50,000
II. Related WorkII. Related Work
• Suppression
• Anonymization
• Partitioning
• Data Logging
• Conceptual
• Hybrid
• Perturbation
II. Related Work-II. Related Work-SuppressionSuppression
• Must access n records
• Only n queries per day
• There are known methods to get around these protections.
II. Related Work-II. Related Work-AnonymizationAnonymization
• Replace the identifying fields with special characters.
• This method can still be compromised.
II. Related Work- II. Related Work- AnonymizationAnonymization
Name Dept Sex Salary
* CS M 30,000
* CS M 100,000
* CS F 50,000
* IT M 50,000
* IT M 60,000
* IT F 70,000
* IT M 50,000
II. Related Work-II. Related Work-PartitioningPartitioning
• All queries must access more than one band of records.
II. Related Work-II. Related Work-PartitioningPartitioning
Name Dept Sex Salary
Bob CS M 30,000
Fred CS M 100,000
Mary CS F 50,000
Tim IT M 50,000
Tom IT M 60,000
Martha IT F 70,000
Ken IT M 50,000
II. Related Work –II. Related Work –LoggingLogging
• A log of every query ran is kept.
• Before a query is allowed all possible inferences are checked. If it releases one record, then that query is not permitted.
• Soon there are no queries allowed.
II. Related work – II. Related work – ConceptualConceptual
• Design the database so that no confidential information is stored.
II. Related Work –II. Related Work –HybridHybrid
• Try using a combination of several of these methods.
II. Related Work - PerturbationII. Related Work - Perturbation
• Output Perturbation
• Data Perturbation
• Liew Perturbation
• Nielson Perturbation
• Note: Perturbation means data changing
II. Related Work –II. Related Work –Output PerturbationOutput Perturbation
• Output perturbation works by changing the output of the query not the physical data.
II. Related Work – II. Related Work – Output PerturbationOutput Perturbation
Data Output Perturbed Data
1 101
2 92
3 103
4 91
5 100
6 81
7 122
8 113
9 103
10 94
x 100 85
s 11.6
II. Related Work –II. Related Work –Data PerturbationData Perturbation
• Data perturbation works by changing the physical data.
• Two common methods:
1. To add a random value to each value
2. To multiple each value by a random value
II. Related Work – II. Related Work – Data PerturbationData Perturbation
# Data 20% Output Perturbed Error
6 81 96 15
4 91 95 4
2 92 79 13
10 94 89 5
5 100 114 14
1 101 105 4
3 103 120 17
9 103 106 3
8 113 129 16
7 122 120 2
x 100 105.3 9.3
s 11.6 15.7 6.1
II. Related Work –II. Related Work –Data PerturbationData Perturbation
Uniform Random Distribution
0
20
40
60
80
100
120
0 200 400 600 800 1000 1200
Query Size
Fit
nes
s
Fitness
II. Related Work –II. Related Work –Liew PerturbationLiew Perturbation
• Liew perturbation steps:
1. Calculate the average, standard deviation, and count of the data
2. Generate a new data set with the same average, standard deviation and count
3. Sort both data sets in ascending order
4. Swap the perturbed values with each other.
II. Related Work–Liew PerturbationII. Related Work–Liew Perturbation #
Data Liew Perturbed Data
Error
6 81 87 6
4 91 93 2
2 92 97 5
10 94 99 5
5 100 99 1
1 101 101 0
3 103 105 2
9 103 115 12
8 113 118 5
7 122 121 1
x 100 103.5 3.9
s 11.6 11.1 3.5
II. Related Work – II. Related Work – Liew PerturbationLiew Perturbation
Liew Perturbation
0
20
40
60
80
100
120
0 200 400 600 800 1000 1200
Size
Fit
nes
s
Fitness
III Hypothesis and ProofIII Hypothesis and Proof
• Prove:
• H1: Nielson perturbation is better than No Perturbation
• H2: Nielson perturbation is better than data perturbation (20%)
• H3: Nielson perturbation is better than Liew perturbation (20%)
III Hypothesis and ProofIII Hypothesis and Proof
• Disprove:
• H1: Nielson perturbation is not better than No Perturbation
• H2: Nielson perturbation is not better than data perturbation (20%)
• H3: Nielson perturbation is not better than Liew perturbation (20%)
IV. MethodologyIV. Methodology
• What is Nielson Perturbation?
• Calculating the absolute error . . .
• Finding optimal values for Nielson perturbation . . .
• Experimental design . . .
• Conducting the experiment . . .
IV. Methodology-IV. Methodology-Nielson PerturbationNielson Perturbation
• Nielson Perturbation is a form of data perturbation.
• Each value is multiplied by a random value between alpha and beta for the first gamma records in the data set.
• This value is randomly negated.
IV Methodology- IV Methodology- Nielson PerturbationNielson Perturbation
IV. Methodology - NielsonIV. Methodology - Nielson # Data Nielson Perturbed Data Error
6 81 72 9
4 91 81 10
2 92 81 11
10 94 76 18
5 100 116 16
1 101 86 15
3 103 83 20
9 103 89 14
8 113 98 15
7 122 106 16
x 100 88.8 14.4
s 11.6 13.8 3.5
IV. Methodology-IV. Methodology-Alpha/Beta/GammaAlpha/Beta/Gamma
• What are the best values?
• An evolutionary algorithm was deployed.
• The results after several days of computation were:
1. Alpha = 2.09
2. Beta = 1.18
3. Gamma = 66.87
IV. Methodology-IV. Methodology- Evolutionary Results Evolutionary Results
Alpha Beta Gamma Fitness
1.50 1.11 335.38 46.30
1.75 0.75 269.71 46.49
1.33 1.27 233.69 46.41
1.40 1.33 382.70 46.43
2.09 1.18 66.87 46.20
1.34 1.24 154.18 46.55
1.33 1.16 60.31 47.26
1.48 0.93 193.00 46.90
1.63 1.09 105.35 46.70
1.29 0.97 106.76 46.92
IV. Methodology-IV. Methodology-Nielson PerturbationNielson Perturbation
Nielson Perturbation
-250
-200
-150
-100
-50
0
50
100
150
0 200 400 600 800 1000 1200
Query Size
Fit
nes
s
Fitness
IV. Methodology-IV. Methodology-Nielson PerturbationNielson Perturbation
Nielson - First Part
-250
-200
-150
-100
-50
0
50
100
150
0 20 40 60 80 100 120
Query Size
Fit
nes
s
Fitness
IV. Methodology- The MethodIV. Methodology- The Method
• Calculate the average error of each method.
• Use the law of large numbers:
An average of averages approaches a normal distribution as the sample size grows.
IV. Methodology- The MethodIV. Methodology- The Method
• Use a t-test to calculate whether two sample means are statistically different from each other with a significance of 95%
IV. Methodology- IV. Methodology- Monte Carlo SimulationMonte Carlo Simulation
• Randomly generate 100,000 databases and execute 100’s of queries.
• I will use arrays to test the accuracy. Speed is of major importance here.
• Arrays vs. databases do not matter for calculating the accuracy of query outputs
IV. Methodology- IV. Methodology- Calculating the average errorCalculating the average error
• The error should be bigger with smaller query sizes.
• The error should be smaller with larger query sizes.
IV.IV. Methodology-Methodology-The Fitness FunctionThe Fitness Function
e=|x-x’|
If q < n/2
fitness=100-e
Else
fitness=e
Smaller fitness scores are better
V. Results and ConclusionsV. Results and Conclusions
V. Results and ConclusionsV. Results and Conclusions
V. Results and ConclusionsV. Results and Conclusions
V. Results and ConclusionsV. Results and ConclusionsSignificanceSignificance
• There is a real need for partial disclosure of a field in a table.
• My method insures a higher degree of security.
• My method still allows for release of averages and totals.
VI. Further StudiesVI. Further Studies
• Transformation Times
• On the fly perturbing