op#mizing)u#lity:)) beststrategy)for)answering)queries · today’s)class) •...
TRANSCRIPT
Op#mizing u#lity: Best Strategy for Answering Queries
CompSci 590.03 Instructor: Ashwin Machanavajjhala
1 Lecture 14 : 590.03 Fall 13
Recap: Laplace Mechanism Thm: If sensi(vity of the query is S, then adding Laplace noise with
parameter λ guarantees ε-‐differen#al privacy, when
λ = S/ε
Sensi(vity: Smallest number s.t. for any d, d’ differing in one entry, || q(d) – q(d’) || ≤ S(q)
Histogram query: Sensi#vity = 2 • Variance / error on each entry = 2λ2 = 2x4/ε2
2 Lecture 14 : 590.03 Fall 13
Example 1: Enforcing Constraints • Database of values {x1, x2, …, xk}
• Query Set: – Value of x1 η1 = x1 + δ1 – Value of x2 η2 = x2 + δ2 – Value of x1 + x2 η3 = x1 + x2 + δ3
• But we know that -‐ η1 and η2 should sum up to η3 -‐ η1 <= η3 AND η2 <= η3
Lecture 13: 590.03 Fall 13 3
Example 2: Query Strategy • Query Set:
– Value of x1 – Value of x2 – Value of x1 + x2
• Strategy 1: Answer all queries – Sensi#vity = 2; Error in each query is 8/ε2.
• Strategy 2: Answer query 1 and query 2. Query 3 = query 1 + query 2
– Sensi#vity = 1 – Error in query 1 and query 2: 2/ε2. – Error in query 3: 4/ε2.
Lecture 13: 590.03 Fall 13 4
Today’s class • How to answer a workload of queries with the least error?
• Constrained inference: – Ensure the query answers are consistent with each other – Order constraints – Sum constraints
• Query Strategy: – A workload W may be best answered by answering a different query set A,
and then compu#ng W from A – Hierarchical, Wavelet and Matrix Mechanism for linear queries.
Lecture 13: 590.03 Fall 13 5
Outline • Strategies for answering a query workload
• All Range queries on 1D domain – Hierarchical Mechanism [Hay et al VLDB 10]
– Wavelet Mechanism [Xiao et al ICDE 09]
• General Query Workloads – Matrix Mechanism [Li et al PODS 10]
Lecture 14 : 590.03 Fall 13 6
Note • The following solu#on ideas are useful whenever
– You want to answer a set of correlated queries. – Queries are based on noisy measurements. – Each measurement (x1 or x1+x2) has similar variance.
Lecture 14 : 590.03 Fall 13 7
Query Strategy
Lecture 13: 590.03 Fall 13 8
I
Private Data
W A
Differen(al Privacy
A(I) A(I) W(I) ~ ~
Original Query Workload
Strategy Query Workload
Noisy Strategy Answers
Noisy Workload Answers
When do you do this? • Query workload W has a high sensi#vity
– Removing or changing a tuple affect a large number of the queries
• Query strategy A has low sensi#vity
• Each query in W can be answered using a small number of query answers in A – This ensures that the noise in answering queries in W is not too high
Lecture 14 : 590.03 Fall 13 9
Example Workload: All Range Queries • Given a set of values {v1, v2, …, vn} • Let xi = number of tuples with value v1. • Range query: q(j,k) = xj + … + xk Q: Suppose we want to answer all range queries?
Lecture 14 : 590.03 Fall 13 10
All Range Queries • Given a set of values {x1, x2, …, xn}
• Range query: q(j,k) = xj + … + xk Q: Suppose we want to answer all range queries?
Strategy 1: Answer all range queries using Laplace mechanism
• Sensi#vity = O(n2) • O(n4/ε2) total error across all range queries. • May reduce using constrained op#miza#on …
Lecture 13: 590.03 Fall 13 11
All Range Queries • Given a set of values {x1, x2, …, xn}
• Range query: q(j,k) = xj + … + xk Q: Suppose we want to answer all range queries?
Strategy 2: Answer all xi queries using Laplace mechanism Answer range queries using noisy xi values.
• O(1/ε2) error for each xi. • Error(q(1,n)) = O(n/ε2) • Total error on all range queries : O(n3/ε2)
Lecture 13: 590.03 Fall 13 12
Hierarchical Mechanism for Range Queries
Strategy 3: Answer sufficient sta?s?cs using Laplace mechanism Answer range queries using noisy sufficient sta#s#cs.
Lecture 13: 590.03 Fall 13 13
x1 x2 x3 x4 x5 x6 x7 x8
x12 x34 x56 x78
x1234 x5678
x1-‐8
[Hay et al VLDB 2010]
Hierarchical Mechanism for Range Queries
• Sensi#vity: log n • q(2,6) = x2 + x34 + x56 Error = 2 x 3log2n/ε2
Lecture 13: 590.03 Fall 13 14
x1 x2 x3 x4 x5 x6 x7 x8
x12 x34 x56 x78
x1234 x5678
x1-‐8
Hierarchical Mechanism for Range Queries
• Every range query can be answered by summing at most log n different noisy answers
• Maximum error on any range query = O(log3n / ε2) • Total error on all range queries = O(n2 log3n / ε2)
Lecture 13: 590.03 Fall 13 15
x1 x2 x3 x4 x5 x6 x7 x8
x12 x34 x56 x78
x1234 x5678
x1-‐8
Summary of Hierarchical Mechanism • Answering all range queries on a 1D axribute
– Op#mal mechanism for range queries is also an op#mal mechanism for compu#ng the CDF.
• Error depends on the query strategy – Noisily answer all range queries: Total error = O(n4/ε2)
– Noisily answer each count, and use them to answer range queries: Total error = O(n3/ε2)
– Hierarchical Mechanism Answer counts on a binary tree noisily: Total error = O(n2 log3n / ε2)
Lecture 14 : 590.03 Fall 13 16
Outline • All Range queries on 1D domain
– Hierarchical Mechanism [Hay et al VLDB 10]
– Wavelet Mechanism [Xiao et al ICDE 09]
• General Query Workloads – Matrix Mechanism [Li et al PODS 10]
Lecture 14 : 590.03 Fall 13 17
Wavelet Mechanism
Lecture 14 : 590.03 Fall 13 18
x1 x2 x3 x4 x5 xn
C2 C3 Cm
…
… C1
Step 1: Compute Wavelet coefficients
C2+η2 C3+η3 Cm+ηm … C1+η1
Step 2: Add noise to coefficients
y1 y2 y3 y4 y5 yn … Step 3: Reconstruct
original counts
Haar Wavelet
Lecture 14 : 590.03 Fall 13 19
Haar Wavelet
Lecture 14 : 590.03 Fall 13 20
For an internal node, Let a = average of leaves in
ley subtree Let b = average of leaves in
right subtree
Haar Wavelet Reconstruc#on
Lecture 14 : 590.03 Fall 13 21
Sum of coefficients on root to leaf path
• + if xi is in the ley subtree of coefficient
• -‐ if xi is in right subtree
Haar Wavelet : Range Queries
Lecture 14 : 590.03 Fall 13 22
Range Query: number of tuples in a range S = [a,b]
Let α(c) be the number of values in the
ley subtree of c that are in S Let β(c) be the number of values in the
right subtree of c that are in S
Haar Wavelet : Range Queries
Lecture 14 : 590.03 Fall 13 23
α(c) – β(c) = 0 when no leaves under c
are contained in S α(c) – β(c) = 0 when all leaves under c
are contained in S Only need to consider those coefficients
with par#al overlap with the range.
Adding noise to wavelet coefficients • Associate each coefficient with a weight • level( c ) = height of c in the tree.
• Generalized sensi#vity (ρ)
Lecture 14 : 590.03 Fall 13 25
Adding noise to wavelet coefficients Theorem: Adding noise to a coefficient c from Laplace(λ/W(c))
guarantees (2ρ/λ)-‐differen#al privacy. Proof:
Lecture 14 : 590.03 Fall 13 26
Generalized Sensi#vity of Wavelet Mechanism
Proof: • Any coefficient changes by 1/m, where m is the number of values
in its subtree. • m = 1/W(c) • Only c0 and the coefficients in one root to leaf path change if
some xi changes by 1.
Lecture 14 : 590.03 Fall 13 27
Error in answering range queries
• Range query depends on at most O(log n) coefficients.
• Error in each coefficient is at most O(log2n/ε2)
• Error in a range query is O(log3n/ε2)
Lecture 14 : 590.03 Fall 13 28
Summary of Wavelet Mechanism • Query Strategy: use wavelet coefficients
• Can be computed in linear #me
• Noise in each range query: O(log3n/ε2)
Lecture 14 : 590.03 Fall 13 29
Outline • All Range queries on 1D domain
– Hierarchical Mechanism [Hay et al VLDB 10]
– Wavelet Mechanism [Xiao et al ICDE 09]
• General Query Workloads – Matrix Mechanism [Li et al PODS 10]
Lecture 14 : 590.03 Fall 13 30
Linear Queries • A set of linear queries can be represented by a matrix • X = [x1, x2, x3, x4] is a vector
represen#ng the counts of 4 values • H4 X represents the following 7 queries
– x1+x2+x3+x4 – x1+x2 – x3+x4 – x1 – x2 – x3 – x4
Lecture 14 : 590.03 Fall 13 31
Query Matrices
Lecture 14 : 590.03 Fall 13 32
Iden(ty Binary Index Haar Wavelet
Sensi#vity of a Query Matrix • How many queries are affected by a change in a single count?
Lecture 14 : 590.03 Fall 13 33
Sensi(vity = 1 Sensi(vity = 3 Sensi(vity = 3
Laplace Mechanism
Lecture 14 : 590.03 Fall 13 34
Sensi(vity
Noise Vector of Laplace(1)
Matrix Mechanism
Lecture 14 : 590.03 Fall 13 35
Original Data Noisy
Representa(on
Reconstructed Data
Final query answer
Reconstruc#on
Lecture 14 : 590.03 Fall 13 36
Matrix Mechanism
Lecture 14 : 590.03 Fall 13 37
Error analysis
Lecture 14 : 590.03 Fall 13 38
Extreme strategies • Strategy A = In
– Noisily answer each xi – Answer queries using noisy counts
• Strategy A = W – Add noise to all the query answers
Lecture 14 : 590.03 Fall 13 39
Good when each query hits a few
values.
Good when sensi#vity is small (or each value hits a small # queries)
Finding the Op#mal Strategy • Find A that minimizes TotalErrorA(W)
– Reduces to solving a semi-‐definite program with rank constraints – O(n6) running #me.
• See paper for approxima#ons and an interes#ng discussion on geometry.
Lecture 14 : 590.03 Fall 13 40
Summary • A linear query workload and strategy can be modeled using
matrices
• Previous techniques to find a bexer strategy to answer a batch of queries is subsumed by the matrix mechanism
• General mechanism to answer queries.
• Noise depends on the sensi#vity of the strategy and AtA-‐1
Lecture 14 : 590.03 Fall 13 41
Next Class • Sparse Vector Technique
– Answering a workload of “sparse” queries
Lecture 14 : 590.03 Fall 13 42
References M. Hay, V. Rastogi, G. Miklau, D. Suciu, “Boos#ng the Accuracy of Differen#ally Private
Histograms Through Consistency”, PVLDB 2010 X. Xiao, G. Wang, J. Gehrke, “Differen#al Privacy via Wavelet Transform”, ICDE 2009 C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor, “Op#mizing Linear Queries under
Differen#al Privacy”, PODS 2010
Lecture 14 : 590.03 Fall 13 43