differential privacy on linked data: theory and implementation

Differential Privacy on Linked Data: Theory and ImplementationYotam Aron

Table of Contents

• Introduction• Differential Privacy for Linked Data• SPIM implementation• Evaluation

Contributions

• Theory on how to apply differential privacy to linked data.• Experimental implementation of differential

privacy on linked data.• Overall privacy module for SPARQL queries.

Introduction

Overview: Why Privacy Risk?

• Statistical data can leak privacy.• Mosaic Theory: Different data sources harmful

when combined.• Examples:• Netflix Prize Data set• GIC Medical Data set• AOL Data logs

• Linked data has added ontologies and meta-data, making it even more vulnerable.

http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-447/paper7.pdf

http://ftp.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-447/paper7.pdf

Current Solutions

• Accountability:• Privacy Ontologies• Privacy Policies and Laws

• Problems:• Requires agreement among parties.• Does not actually prevent breaches, just a deterent.• Heterogeneous

http://ceur-ws.org/Vol-813/ldow2011-paper01.pdf

Current Solutions (Cont’d)

• Anonymization• Delete “private” data• K – anonymity (Strong Privacy Guarantee)

• Problems• Deletion provides no strong guarantees• Must be carried out for every data set• What data should be anonymized?• High computational cost (k-anonimity is np-hard)

http://spdp.dti.unimi.it/papers/k-Anonymity.pdf

http://research.microsoft.com/pubs/77537/k-anonymity-jopt.pdf





Differential Privacy

• Definition for relational databases (from PINQ paper):

A randomized function K gives Ɛ-differential privacy if for all data sets and differing on at most one record, and all ,

Differential Privacy

• What does this mean?• Adversaries get roughly same results from and ,

meaning a single individual’s data will not greatly affect their knowledge acquired from each data set.

How Achieved?• Add noise to result.• Simplest: Add Laplace noise

Laplace Noise Parameters• Mean = 0 (so don’t add bias)• Variance = , where is defined, for a record j, as• • Theorem: For query Q result R, the output R + Laplace(0, ) is

differentially private.

Other Benefit of Laplace Noise• A set of queries each with sensitivity will have an overall

sensitivity of • Implementation-wise, can allocate an “budget” Ɛ for a client

and for each query client specifies to use.

Benefits of Differential Privacy• Strong Privacy Guarantee• Mechanism-Based, so don’t have to mess with data.• Independent of data set’s structure.• Works well with for statistical analysis algorithms.

Problems with Differential Privacy• Potentially poor performance• Complexity (especially for non-linear functions)• Noise

• Only works with statistical data (though this has fixes)• How to calculate sensitivity of arbitrary query?

Differential Privacy for Linked Data

Differential Privacy and Linked Data• Want same privacy guarantees for linked data without, but no

“records.”• What should be “unit of difference”?• One triple• All URIs related to person’s URI• All links going out from person’s URI


“records.”• What should be “unit of difference”?

•One triple• All URIs related to person’s URI• All links going out from person’s URI


“records.”• What should be “unit of difference”?• One triple

•All URIs related to person’s URI• All links going out from person’s URI


“records.”• What should be “unit of difference”?• One triple• All URIs related to person’s URI

•All links going out from person’s URI

“Records” for Linked Data• Reduce links in graph to attributes • Idea: • Identify individual contributions from a single individual to total

answer.• Find contribution that affects answer most.

“Records” for Linked Data• Reduce links in graph to attributes, makes it a record.

P1 P2Knows

Person Knows

P1 P2

“Records” for Linked Data• Repeated attributes and null values allowed

P1 P2Knows

P3 P4

Loves

Knows

Knows

“Records” for Linked Data• Repeated attributes and null values allowed (not good RDBMS form

but makes definitions easier)

Person Knows Knows Loves

P1 P2 Null P4

P3 P2 P4 Null

Query Sensitivity in Practice• Need to find triples that “belong” to a person.• Idea:• Identify individual contributions from a single individual to total

answer.• Find contribution that affects answer most.

• Done using sorting and limiting functions in SPARQL

Example• COUNT of places

visited

P1

P2

MA

S2

S3

State of Residence

S1

Visited

Example• COUNT of places

visited

P1

P2

MA

S2

S3

State of Residence

S1

Visited

Answer: Sensitivity of 2

Using SPARQL• Query:

(COUNT(?s) as ?num_places_visited) WHERE{?p :visited ?s }

Using SPARQL• Sensitivity Calculation Query (Ideally):

SELECT ?p (COUNT(ABS(?s)) as ?num_places_visited) WHERE{

?p :visited ?s;?p foaf:name ?n }

GROUP BY ?p ORDER BY ?num_places_visited LIMIT 1

In reality…• LIMIT, ORDER BY, GROUP BY doesn’t work together in 4store…• For now: Don’t use LIMIT and get top answers manually.• I.e. Simulate using these in python

• Would like to keep it on sparql-side ideally so there is less transmitted data (e.g. on large data sets)

(Side rant) 4store limitations• Many operations not supported in unison• E.g. cannot always filter and use “order by” for some reason• Severely limits the types of queries I could use to test.• May be desirable to work with a different triplestore that is

more up-to-date (ARQ). • Didn’t because wanted to keep code in python.• Also had already written all code for 4store

Problems with this Approach• Need to identify “people” in graph.• Assume, for example, that URI with a foaf:name is a person and

use its triples in privacy calculations.• Imposes some constraints on linked data format for this to work.• For future work, maybe there’s a way to automatically identify

private data, maybe by using ontologies.• Complexity is tied to speed of performing query over large

data set.

…and on the Plus Side• Model for sensitivity calculation can be expanded to arbitrary

statistical functions.• e.g. dot products, distance functions, etc.

• Relatively simple to implement using SPARQL 1.1

Differential Privacy Protocol

Differential Privacy ModuleClient SPARQL

Endpoint

Scenario: Client wishes to make standard SPARQL 1.1 statistical query. Client has Ɛ “budget” of overall accuracy for all queries.



Endpoint

Step 1: Query and epsilon value sent to the endpoint and intercepted by the enforcement module.

Query, Ɛ > 0



Endpoint

Step 2: The sensitivity of the query is calculated using a re-written, related query.

Sens Query



Endpoint

Step 3: Actual query sent.

Query



Endpoint

Step 4: Result with Laplace noise sent over.

Result and Noise

Design of Privacy System

SPARQL Privacy Insurance Module• i.e. SPIM• Use authentication, AIR, and differential privacy in one system.• Authentication to manage Ɛ-budgets.• AIR to control flow of information and non-statistical data.• Differential privacy for statistics.

• Goal: Provide a module that can integrate into SPARQL 1.1 endpoints and provide privacy.

Design

Triplestore

User DataPrivacy Policies

SPIM Main Process AIR Reasoner

Differential Privacy Module

HTTP ServerOpenID Authentication

HTTP Server and Authentication

• HTTP Server: Django server that handles http requests.

• OpenID Authentication: Django module.

HTTP Server

OpenID Authentication

SPIM Main Process• Controls flow of information. • First checks user’s budget, then

uses AIR, then performs final differentially-private query.

SPIM Main Process

AIR Reasoner• Performs access control by

translating SPARQL queries to n3 and checking against policies.

• Can potentially perform more complicated operations (e.g. check user credentials)

Privacy Policies

AIR Reasoner

Differential Privacy• Works as discussed in previous

slides.

• Contains users and their Ɛ-values.


User Data

Evaluation

Evaluation• Three things to evaluate:• Correctness of operation• Correctness of differential privacy• Runtime

• Used a anonymized clinical database as the test data and added fake names, social security numbers, and addresses.

Correctness of Operation• Can the system do what we want?• Authentication provides access control• AIR restricts information and types of queries• Differential privacy gives strong privacy guarantees.

• Can we do better?

Use Case Used in Thesis• Clinical database data protection• HIPAA: Federal protection of private information fields, such as

name and social security number, for patients.• 3 users• Alice: Works in CDC, needs unhindered access• Bob: Researcher that needs access to private fields (e.g.

addresses)• Charlie: Amateur researcher to whom HIPAA should apply

• Assumptions:• Django is secure enough to handle “clever attacks”• Users do not collude, so can allocate individual epsilon values.

Use Case Solution Overview• What should happen:• Dynamically apply different AIR policies at runtime.• Give different epsilon-budgets.

• How allocated:• Alice: No AIR Policy, no noise.• Bob: Give access to addresses but hide all other private

information fields.• Epsilon budget: E1

• Charlie: Hide all private information fields in accordance with HIPAA• Epsilon budget: E2

Use Case Solution Overview• Alice: No AIR Policy• Bob: Give access to addresses but hide all other private

information fields.• Epsilon budget: E1

• Charlie: Hide all private information fields in accordance with HIPAA• Epsilon budget: E2

Example: A Clinical Database• Client Accesses triplestore via

HTTP server. • OpenID Authentication verifies

user has access to data. Finds epsilon value,

HTTP Server

OpenID Authentication

Example: A Clinical Database• AIR reasoner checks incoming

queries for HIPAA violations.• Privacy policies contain HIPAA

rules.

Privacy Policies

AIR Reasoner

Example: A Clinical Database• Differential Privacy applied to

statistical queries.• Statistical result + noise

returned to client.


Correctness of Differential Privacy• Need to test how much noise is added.• Too much noise = poor results.• Too little noise = no guarantee.

• Test: Run queries and look at sensitivity calculated vs. actual sensitivity.

How to test sensitivity?• Ideally:• Test noise calculation is correct• Test that noise makes data still useful (e.g. by applying machine

learning algorithms).• Fort his project, just tested former• Machine learning APIs not as prevalent for linked data.• What results to compare to?

Test suite• 10 queries for each operation (COUNT, SUM, AVG, MIN, MAX)• 10 different WHERE CLAUSES• Test:• Sensitivity calculated from original query• Remove each personal URI using “MINUS” keyword and see

which removal is most sensitive

Example for Sens Test• Query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1#>PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>

SELECT (SUM(?o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o))}

Example for Sens Test• Sensitivity query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>PREFIX foaf: <http://xmlns.com/foaf/0.1#>PREFIX mimic: <http://air.csail.mit.edu/spim_ontologies/mimicOntology#>

SELECT (SUM(?o) as ?aggr) WHERE{ ?s foaf:name ?n. ?s mimic:event ?e. ?e mimic:m1 "Insulin". ?e mimic:v1 ?o. FILTER(isNumeric(?o)) MINUS {?s foaf:name "%s"} } % (name)

Results Query 6 - Error

Runtime• Queries were also tested for runtime.• Bigger WHERE clauses• More keywords • Extra overhead of doing the calculations.

Results Query 6 - Runtime

Interpretation• Sensitivity calculation time on-par with query time• Might not be good for big data• Find ways to reduce sensitivity calculation time?

• AVG does not do so well…• Approximation yields too much noise vs. trying all possibilities• Runs ~4x slower than simple querying• Solution 1: Look at all data manually (large data transfer)• Solution 2: Can we use NOISY_SUM / NOISY_COUNT instead?

Conclusion

Contributions• Theory on how to apply differential privacy to linked data.• Experimental implementation of differential privacy.• Verification that it is applied correctly.

• Overall privacy module for SPARQL queries.• Limited but a good start

• Other:• Updated sparql to n3 translation to Sparql version 1.1• Expanded upon IARPA project to create policies against statistical

queries.

Shortcomings and Future Work• Triplestores need some structure for this to work• Personal information must be explicitly defined in triples.• Is there a way to automatically detect what triples would

constitute private information? • Complexity• Lots of noise for sparse data.• Can divide data into disjoint sets to reduce noise like PINQ does • Use localized sensitivity measures?

• Third party software problems• Would this work better using a different Triplestore

implementation?

Other work• Other implementations:• PINQ • Airavat• PDDP

• Some of the Theoretical Work Out There• Differential privacy paper• Exponential Mechanism• Noise Calculation• Differential Privacy and Machine Learning

http://research.microsoft.com/pubs/80218/sigmod115-mcsherry.pdf

http://www.cs.utexas.edu/~shmat/shmat_nsdi10.pdf

https://www.usenix.org/conference/nsdi12/towards-statistical-queries-over-distributed-private-user-data

http://research.microsoft.com/pubs/64346/dwork.pdf

http://research.microsoft.com/pubs/65075/mdviadp.pdf

http://people.csail.mit.edu/asmith/PS/sensitivity-tcc-final.pdf

http://www.stat.cmu.edu/~jiashun/Research/Area/KDD04.pdf

Appendix: Results Q1, Q2

Q2 Error Query_Time Sens_Calc_TimeCOUNT 0 0.015823126 0.011798859SUM 0 0.010298967 0.01198101AVG 868.8379 0.010334969 0.04432416MAX 0 0.010645866 0.012124062MIN 0 0.010524988 0.012120962

Appendix: Results Q3, Q4Q3 Error Query_Time Sens_Calc_TimeCOUNT 0 0.007927895 0.00800705SUM 0 0.007529974 0.007997036AVG 375.8253 0.00763011 0.030416012MAX 0 0.007451057 0.008117914MIN 0 0.007512093 0.008100986


Appendix: Results Q7, Q8Q7 Error Query_Time Sens_Calc_TimeCOUNT 0 0.006100178 0.004678965SUM 0 0.004260063 0.004747868AVG 0 0.004283905 0.017117977MAX 0 0.004103184 0.004703999MIN 0 0.004188061 0.004717112

Q8 Error Query_Time Sens_Calc_TimeCOUNT 0 0.002182961 0.002643108SUM 0 0.002092123 0.002592087AVG 0 0.002075911 0.002662182MAX 0 0.00207901 0.002576113MIN 0 0.002048969 0.002597094

differential privacy on linked data: theory and implementation

Documents

data setwhat data

statistical data

linked datawant

privacy risk

linked datareduce links

linked datarepeated

overall privacy module

persons uriall links