managing distributed queries under personalized anonymity ... · managing distributed queries under...

12
HAL Id: hal-01682316 https://hal.archives-ouvertes.fr/hal-01682316 Submitted on 12 Jan 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel, Benjamin Nguyen, Philippe Pucheral To cite this version: Axel Michel, Benjamin Nguyen, Philippe Pucheral. Managing Distributed Queries under Personalized Anonymity Constraints. 6th International Conference on Data Science, Technology and Applications - DATA 2017, 2017, Madrid, Spain. hal-01682316

Upload: others

Post on 12-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

HAL Id: hal-01682316https://hal.archives-ouvertes.fr/hal-01682316

Submitted on 12 Jan 2018

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Managing Distributed Queries under PersonalizedAnonymity Constraints

Axel Michel, Benjamin Nguyen, Philippe Pucheral

To cite this version:Axel Michel, Benjamin Nguyen, Philippe Pucheral. Managing Distributed Queries under PersonalizedAnonymity Constraints. 6th International Conference on Data Science, Technology and Applications- DATA 2017, 2017, Madrid, Spain. �hal-01682316�

Page 2: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

Managing Distributed Queries under Personalized AnonymityConstraints

Axel Michel1,2, Benjamin Nguyen1,2 and Philippe Pucheral21SDS Team at LIFO, INSA-CVL, Boulevard Lahitolle, Bourges, France

2PETRUS, INRIA Saclay, Palaiseau, France{axel.michel, benjamin.nguyen}@insa-cvl.fr, [email protected]

Keywords: Data Privacy and Security, Big Data, Distributed Query Processing, Secure Hardware

Abstract: The benefit of performing Big data computations over individual’s microdata is manifold, in the medical,energy or transportation fields to cite only a few, and this interest is growing with the emergence of smartdisclosure initiatives around the world. However, these computations often expose microdata to privacy leak-ages, explaining the reluctance of individuals to participate in studies despite the privacy guarantees promisedby statistical institutes. This paper proposes a novel approach to push personalized privacy guarantees in theprocessing of database queries so that individuals can disclose different amounts of information (i.e. data at dif-ferent levels of accuracy) depending on their own perception of the risk. Moreover, we propose a decentralizedcomputing infrastructure based on secure hardware enforcing these personalized privacy guarantees all alongthe query execution process. A performance analysis conducted on a real platform shows the effectiveness ofthe approach.

1 INTRODUCTION

In many scientific fields, ranging from medicine to so-ciology, computing statistics on (often personal) pri-vate and sensitive information is central to the disci-pline’s methodology. With the advent of the Web, andthe massive databases that compose it, statistics andmachine learning have become “data science”: theirgoal is to turn large volumes of information linkedto a specific individual, called microdata, into knowl-edge. Big Data computation over microdata is of ob-vious use to the community: medical data is used toimprove the knowledge of diseases, and find cures:energy consumption is monitored in smart grids to op-timize energy production and resources management.In these applications, real knowledge emerges fromthe analysis of aggregated microdata, not from the mi-crodata itself1.

Smart disclosure initiatives, pushed by legislators(e.g. EU General Data Protection Regulation (Eu-ropean Union, 2016)) and industry-led consortiums(e.g. blue button and green button in the US2, Midata3

1We thus do not consider applications such as targetedadvertising, who seek to characterize the users at individuallevel.

2https://www.healthit.gov/patients-families/3https://www.gov.uk/government/news/

in the UK, MesInfos4 in France), hold the promise ofa deluge of microdata of great interest for analysts. In-deed, smart disclosure enables individuals to retrievetheir personal data from companies or administrationsthat collected them. Current regulations carefully re-strict the uses of this data to protect individual’s pri-vacy. However, once the data is anonymized, its pro-cessing is far less restricted. This is good news, sincein most cases, these operations (i.e. global databasequeries) can provide results of tunable quality whenrun on anonymized data.

Unfortunately, the way microdata is anonymizedand processed today is far from being satisfactory.Let us consider how a national statistical study ismanaged, e.g. computing the average salary per ge-ographic region. Such a study is usually divided into3 phases: (1) the statistical institute (assumed to bea trusted third party) broadcasts a query to collectraw microdata along with anonymity guarantees (i.e.,a privacy parameter like k in the k−anonymity or εin the differential privacy sanitization models) to allusers ; (2) each user consenting to participate trans-mits her microdata to the institute ; (3) the institutecomputes the aggregate query, while respecting theannounced anonymity constraint.

4http://mesinfos.fing.org/

Page 3: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

This approach has two important drawbacks:

1. The anonymity guarantee is defined by the querier(i.e. the statistical institute), and applies uniformlyto all participants. If the querier decides to pro-vide little privacy protection (e.g. a small k in thek−anonymity model), it is likely that many userswill not want to participate in the query. On thecontrary, if the querier decides to provide a highlevel of privacy protection, many users will bewilling to participate, but the quality of the resultswill drop. Indeed, higher privacy protection is al-ways obtained to the detriment of the quality andthen utility of the sanitized data.

2. The querier is assumed to be trusted. Althoughthis could be a realistic assumption in the caseof a national statistics institute, this means it isimpossible to outsource the computation of thequery. Moreover, microdata centralization exacer-bates the risk of privacy leakage due to piracy (Ya-hoo and Apple recent hack attacks are emblematicof the weakness of cyber defenses5), scrutiniza-tion and opaque business practices. This erodesindividuals trust in central servers, thereby reduc-ing the proportion of citizen consenting to partic-ipate in such studies, some of them unfortunatelyof great societal interest.

The objective of this paper is to tackle thesetwo issues by reestablishing user’s empowerment,a principle called by all recent legislations protect-ing the management of personal data (EuropeanUnion, 2016). Roughly speaking, user’s empower-ment means that the individual must keep the con-trol of her data and of its disclosure in any situation.More precisely, this paper makes the following con-tributions:

• proposing a query paradigm incorporating person-alized privacy guarantees, so that each user cantrade her participation in the query for a privacyprotection matching her personal perception ofthe risk,

• providing a secure decentralized computingframework guaranteeing that the individual keepsher data in her hands and that the query issuernever gets cleartext raw microdata and sees onlya sanitized aggregated query result matching allpersonalized privacy guarantees,

• conducting a performance evaluation on a realdataset demonstrating the effectiveness and scal-ability of the approach.

5Yahoo ’state’ hackers stole datafrom 500 million users - BBC News.www.bbc.co.uk/news/world-us-canada-37447016

The rest of the paper is organized as follows. Sec-tion 2 presents related works and background ma-terials allowing the precisely state the problem ad-dressed. Section 3 details the core of our contribu-tion. Section 4 shows that the overhead incurred byour algorithm, compared to a traditional query pro-cessing technique, remains largely tractable. Finally,Section 5 concludes.

2 STATE OF THE ART ANDPROBLEM STATEMENT

2.1 Related Works onPrivacy-Preserving Data Publishing

Anonymization has been a hot topic in data publi-cation since the 1960’s for all statistical institutionswanting to publish aggregate data. The objective ofmost of these data publishing techniques is to providesecurity against an attacker who is going to mountdeanonymization attacks, which will link some sen-sitive information (such as their salary or medical di-agnosis) to a specific individual. Particular attentionwas drawn to the problem by Sweeney, the introduc-tion of the k-anonymity model (Sweeney, 2002) thatwe consider in this paper. k-anonymity is a partitionbased approach to anonymization, meaning that theoriginal dataset, composed of individual’s microdata,is partitionned, through generalization or suppressionof values, into groups who have similar values whichwill then be used for grouping.

The partition-based approach splits the attributesof the dataset in two categories: a quasi-identifier andsome sensitive data. A quasi-identifier (denoted QID)is a set of attributes for which some records may ex-hibit a combination of unique values in the dataset,and consequently be identifying for the correspond-ing individuals (e.g., ZipCode, BirthDate, Gender).The sensitive part (denoted SD) encompasses the at-tribute(s) whose association with individuals must bemade ambiguous (e.g., Disease).

Partition-based approaches essentially apply acontrolled degradation to the association between in-dividuals (represented in the dataset by their quasi-identifier(s)) and their sensitive attribute(s). The ini-tial dataset is deterministically partitioned into groupsof records (classes), where quasi-identifier and sen-sitive values satisfy a chosen partition-based privacymodel. The original k-Anonymity model (Sweeney,2002) requires each class to contain at least k in-distinguishable records, thus each sensitive data willbe associated with at least k records. Many other

Page 4: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

Powerprovider

Super-market

Car insurerA

Photos

MyFiles

Sync.

Fixed Trusted Cell

Heatsensor

Powermeter

F E

D

H

G

I

Employer

Hospital

School

C

SSI

Trusted Data Server

MobileTrusted Cell

Figure 1: Trusted Cells reference architecture.

models have been introduced since 2006, such as`-Diversity (Machanavajjhala et al., 2006) or t-Closeness (Li et al., 2010). Each model further con-strain the distribution of sensitive data within eachclass, tackling different adversarial assumptions. Forexample, the `-Diversity principle requires that the setof sensitive data associated to each equivalence classbe linked to ` different sensitive values. t-closenessrequires each class to have a similar distribution ofsensitive values. To illustrate this, table 1 showsa 3-anonymous and 2-diverse version of a dataset.This means, for each tuple, at least two others havethe same quasi-identifier (i.e. 3-anonymity) and foreach group of tuples with the same quasi-identifier,there are at least two distinct sensitive values (i.e. 2-diversity). It is important to note that the higher the kand `, the better the privacy protection, but the lowerthe precision (or quality) of the query.

Table 1: 3−anonymous and 2−diverse table.

Quasi-identifier SensitiveZIP Age Condition

112** > 25 Cancer112** > 25 Cancer112** > 25 Heart Disease1125* * Heart Disease1125* * Viral Infection1125* * Cancer

A different concept is differential privacy, intro-duced by Dwork in (Dwork, 2006). Differential pri-vacy is more adapted to interactive query answering.It’s advantage is to provide formal guarantees regard-less of the knowledge of the adversary. However, dif-ferential privacy limits the type of computation whichcan be made on the data. Moreover, fixing the privacyparameter ε is a cumbersome and not intuitive task,

out of reach of lambda individuals.Another approach is to make an agreement be-

tween the user and the querier. Te concept of stickypolicies presented by Trablesi, Neven, Ragget etal. (Trabelsi et al., 2011) consists to make a policyabout authorization (i.e. what the querier can do) andobligation(i.e. what he querier must do) which willstick to the user data.

2.2 Reference Computing Architecture

Concurrently with smart disclosure initiatives, thePersonal Information Management System (PIMS)paradigm has been conceptualized (Abiteboul et al.,2015), and emerges in the commercial sphere (e.g.Cozy Cloud, OwnCloud, SeaFile). PIMS holds thepromise of a Privacy-by-Design storage and comput-ing platform where each individual can gather hercomplete digital environment in one place and shareit with applications and other users under her con-trol. The Trusted Cells architecture presented in (An-ciaux et al., 2013), and pictured in Figure 1, preciselyanswers the PIMS requirements by preventing dataleaks during computations on personal data. Hence,we consider Trusted Cells as a reference computingarchitecture in this paper.

Trusted Cells is a decentralized architecture bynature managing computations on microdata throughthe collaboration of two parties. The first party isa (potentially large) set of personal Trusted DataServers (TDSs) allowing each individual to manageher data with tangible elements of trust. Indeed, TDSsincorporate tamper resistant hardware (e.g. smartcard,secure chip, secure USB token) securing the data andcode against attackers and users’ misusages. Despitethe diversity of existing tamper-resistant devices, a

Page 5: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

TDS can be abstracted by (1) a Trusted Execution En-vironment and (2) a (potentially untrusted but cryp-tographically protected) mass storage area where thepersonal data resides. The important assumption isthat the TDS code is executed by the secure devicehosting it and then cannot be tampered, even by theTDS holder herself.

By construction, secure hardware exhibit limitedstorage and computing resources and TDSs inheritthese restrictions. Moreover, they are not necessarilyalways connected since their owners can disconnectthem at will. A second party, called hereafter Sup-porting Server Infrastructure (SSI), is thus requiredto manage the communications between TDSs, runthe distributed query protocol and store the interme-diate results produced by this protocol. Because SSIis implemented on regular server(s), e.g. in the Cloud,it exhibits the same low level of trustworthiness.

The resulting computing architecture is said asym-metric in the sense that it is composed of a very largenumber of low power, weakly connected but highlysecure TDSs and of a powerful, highly available butuntrusted SSI.

2.3 Reference Query ProcessingProtocol

By avoiding delegating the storage of personal datato untrusted cloud providers, Trusted Cells is key toachieve user empowerment. Each individual keepsher data in her hands and can control its disclo-sure. However, the decentralized nature of the TrustedCells architecture must not hinder global computa-tions and queries, impeding the development of ser-vices of great interest for the community. SQL/AA(SQL Asymmetric Architecture) is a protocol to exe-cute standard SQL queries on the Trusted Cells archi-tecture (To et al., 2014; To et al., 2016). It has beenprecisely designed to tackle this issue, that is execut-ing global queries on a set of TDSs without recentral-izing microdata and without leaking any information.

The protocol, illustrated by the Figure 2, works asfollows. Once an SQL query is issued by a querier(e.g. a statistic institute), it is computed in threephases: first the collection phase where the querierbroadcasts the query to all TDSs, TDSs decide to par-ticipate or not in the computation (they send dummytuples in that case to hide their denial of participa-tion), evaluate the WHERE clause and each TDS returnsits own encrypted data to the SSI. Second, the aggre-gation phase, where SSI forms partitions of encryptedtuples, sends them back to TDSs and each TDS par-ticipating to this phase decrypts the input partition,removes dummy tuples and computes the aggregation

function (e.g. AVG, COUNT). Finally the filtering phase,where TDSs produce the final result by filtering outthe HAVING clause and send the result to the querier.Note that the TDSs participating to each phase can bedifferent. Indeed, TDSs contributing to the collectionphase act as data producers while TDSs participatingto the aggregation and filtering phases act as trustedcomputing nodes. The tamper resistance of TDSs isthe key in this protocol since a given TDS belongingto individual i1 is likely to decrypt and aggregate tu-ples issued by TDSs of other individuals i2, . . . , in.Finally, note that the aggregation phase is recursiveand runs until all tuples belonging to a same grouphave been actually aggregated. We refer the interestedreader to (To et al., 2014; To et al., 2016) for a moredetailed presentation of the SQL/AA protocol.

2.4 Problem Statement

In order to protect the privacy of users, queries mustrespect a certain degree of anonymity. Our primaryobjective is to push personalized privacy guaranteesin the processing of regular statistical queries so thatindividuals can disclose different amount of informa-tion (i.e., data at different level of accuracy) depend-ing on their own perception of the risk. To the best ofour knowledge, no existing work has addressed thisissue. For the sake of simplicity, we consider SQL asthe reference language to express statistical/aggregatequeries because of its widespread usage. Similarly,we consider personalized privacy guarantees derivedfrom the k-anonymity and `-diversity models because(1) they are the most used in practice, (2) they are rec-ommended by the European Union (European Union,2014) and (3) they can be easily understood by in-dividuals6. The next step in our research agenda isto extend our approach to other query languages andprivacy guarantees but this ambitious goal exceeds thescope and expectation of this paper

Hence, the problem addressed in this paper is topropose a (SQL) query paradigm incorporating per-sonalized (k-anonymity and `-diversity) privacy guar-antees and enforcing these individual guarantees allalong the query processing without any possible leak-age.

6The EU Article 29 Working Group mention these char-acteristics as strong incentives to make these models effec-tively used in practice or tested by several european coun-tries (e.g., the Netherlands and French statistical institutes).

Page 6: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

SSI TDSsconnect

send Q decrypt Qcheck ACcompute WHEREencrypt tuple t

`.append(t)send t

Pi = `.partition(i) send Pi

decrypt Pi

remove dummiescompute aggregationsencrypt partition Paggi

`′.append(Paggi)

send Paggi

Pj = `′.partition(j) send Pj

decrypt Pj

remove dummiescompute HAVINGfinalize result resi

result.append(resi)send resi

Collectionph

ase

Aggregationph

ase

Filte

ringph

ase

Figure 2: SQL/AA Protocol.

3 PERSONALIZED ANONYMITYGUARANTEES IN SQL

3.1 Modeling Anonymisation UsingSQL

We make the assumption that each individual ownsa local database hosted in her personal TDS and thatthese local databases conform to a common schemawhich can be easily queried in SQL. For exam-ple, power meter data (resp., GPS traces, healthcarerecords, etc) can be stored in one (or several) table(s)whose schema is defined by the national distributioncompany (resp., an insurance company consortium,the Ministry of Health, etc). Based on this assump-tion, the querier (i.e., the statistical institute) can issueregular SQL queries as shown by the Figure 4.

For the sake of simplicity, we do not considerjoins between data stored in different TDSs but inter-nal joins which can be executed locally by each TDSare supported. We refer to (To et al., 2014; To et al.,2016) for a deeper discussion on this aspect which isnot central to our work in this paper.

Anonymity Guarantees are defined by thequerier, and correspond to the k and ` values that willbe achieved by the end of the process, for each groupproduced. They correspond to the commitment ofthe querier towards any query participant. Differentk and ` values can be associated to different granular-ity of grouping. In the example pictured in Figure 3,the querier commits to provide k ≥ 5 and ` ≥ 3 at a(City,Street) grouping granularity and k ≥ 10 and

`≥ 3 at a (City) grouping granularity.Anonymity Constraints are defined by the users,

and correspond to the values they are willing to ac-cept in order to participate in the query. Back to theexample of Figure 3, Alice’s privacy policy stipulatesa minimal anonymization of k ≥ 5 and ` ≥ 3 whenattribute Salary is queried.

According to the anonymity guarantees and con-straints, the query computing protocol is as follows.The querier broadcasts to all potential participants thequery to be computed along with metadata encod-ing the associated anonymity guarantees. The TDSof each participant compares this guarantees with theindividual’s anonymity constraints. This principleshares some similarities with P3P7 with the match-ing between anonymity guarantees and constraints se-curely performed by the TDS. If the guarantees ex-ceed the individual’s constraints, the TDS participatesto the query by providing real data at the finest group-ing granularity. Otherwise, if the TDS finds a group-ing granularity with anonymity guarantees matchingher constraints, it will participate, but by providing adegraded version of the data, to that coarser level ofgranularity (looking at Figure 3, answering the groupby city, street clause is not acceptable for Bob,but answering just with city is). Finally, if no matchcan be found, the TDS produces fake data (calleddummy tuples in the protocol) to hide its denial of par-ticipation. Fake data is required to avoid the querierfrom inferring information about the individual’s pri-vacy policy or about her membership to the WHERE

7https://www.w3.org/P3P/

Page 7: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

Querier1Q1 =

SELECT city, streetAVG(salary)GROUP BY city, street

Q1.metadatagroup by city, street:

anonymity: 5diversity: 3

group by city :anonymity: 10diversity: 3

TDS Alicepolicies:

aggregation on salary :anonymity: 5diversity: 2

TDS Bobpolicies:

aggregation on salary :anonymity: 6diversity: 3

TDS Charliepolicies:

aggregation on salary :anonymity: 5diversity: 4

SSIqueries list:Q1, . . .

connectconnect

connect

Q1 Q1Q1

(salaryAlice,cityAlice,streetAlice)D_FLAG=0

(salaryBob,cityBob)D_FLAG=0

(dummy_salary,dummy_city,dummy_street)

D_FLAG=1

Q1

Figure 3: Example of collection phase with anonymity constraints.

SELECT <Aggregate function(s)>FROM <Table(s)>WHERE <condition(s)>GROUP BY <grouping attribute(s)>HAVING <grouping condition(s)>

Figure 4: Regular SQL query form.

clause of the query.Figure 3 illustrates this behavior. By comparing

the querier anonymity guarantees with their respec-tive constraints, the TDSs of Alice, Bob and Charlierespectively participate with fine granularity values(Alice), with coarse granularity values (Bob), withdummy tuples (Charlie).

The working group ODRL8 is looking at someissues similar to the expression of privacy policies.However, this paper is not discussing about how userscan express their privacy policy in a standard way.

3.2 The kiSQL/AA protocol

We now describe our new protocol, that we callkiSQL/AA to show that it takes into account manydifferent k values of the i different individuals.kiSQL/AA is an extension of the SQL/AA proto-col (To et al., 2014; To et al., 2016) where the enforce-ment of the anonymity guarantees have been pushedin the collection, aggregation and filtering phases.

Collection phase: After TDSs download thequery, they compare the anonymity guarantees an-nounced by the querier with their own anonymityconstraints. As discussed above (see Section 3.1)

8https://www.w3.org/community/odrl/

TDSs send real data at the finest grouping granularitycompliant with their anonymity constraints or send adummy tuple if no anonymity constraint can be satis-fied.

Aggregation phase: To ensure that theanonymization guarantees can be verified atthe filtering phase, clauses COUNT(*) andCOUNT(DISTINCT A) are computed in additionto the aggregation asked by the querier. COUNT(*)will be used to check that the k−anonymity guaranteeis met while COUNT(DISTINCT A) will be usedto check the `−diversity guarantee on attribute (orgroup of attributes) A on which the aggregate functionapplies (e.g. salary in our example)9. If tuples withvarying grouping granularity enter this phase, theyare aggregated separately, i.e. one group per groupinggranularity.

Filtering phase: Besides HAVING predicateswhich can be formulated by the querier, the HAVINGclause is used to check the anonymity guaran-tees. Typically, k−anonymity sums up to checkCOUNT(*)≥ k while `−diversity is checked byCOUNT(DISTINCT A)≥ `. If these guarantees are notmet for some tuples, they are not immediately dis-carded. Instead, the protocol tries to merge them witha group of coarser granularity encompassing them.Let us consider the example of Table 2(a). The tuple(Bourges, Bv.Lahitolle, 1600) is merged withthe tuple (Bourges, ******, 1400) to form the tu-

9Since this clause is an holistic function, we can com-pute it while the aggregation phase by adding naively eachdistinct value under a list or using a cardinality estimationalgorithm such as HyperLogLog (Flajolet et al., 2007).

Page 8: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

Table 2: Filtering phase.(a) Example of a post aggregation phase result

city street AVG(salary) COUNT(*)COUNT

(DISTINCT salary)Le Chesnay Dom. Voluceau 1500 6 4Le Chesnay ****** 1700 9 6

Bourges Bv. Lahitolle 1600 3 3Bourges ****** 1400 11 7

(b) Privacy guarantees ofthe query

Attributes k `city, street 5 3

city 10 3

(c) Data sent to the querier

city street AVG(salary)Le Chesnay Dom. Voluceau 1500

Bourges ****** 1442.86

ple (Bourges, ******, 1442.86). Merges stopwhen all guarantees are met. If, despite merges, theguarantees cannot be met, the corresponding tuplesare removed form the result. Hence, the querier willreceive every piece of data which satisfies the guar-anties, and only these ones, as shown on Table 2(c).

How to generalize. To reach the same k and `values on the groups, the grouping attributes can begeneralized in different orders, impacting the qual-ity of the result for the querier. For instance, if theGroupBy clause involves two attributes Address andAge, would it be better to generalize the tuples onAddress (e.g. replacing <City,Street> by <City>)or on Age (replacing exact values by intervals) ? Itwould be valuable for the querier to give “hints” tothe TDSs on how to generalize the data, by indi-cating which attributes to generalize, and what pri-vacy guarantees will be enforced after each gener-alization. In the following example, we considerthe UCI Adult dataset (Lichman, 2013), we definea GroupBy query GB on attributes Age, Workclass,Education, Marital_status, Occupation, Race,Gender, Native_Country and we compute the av-erage fnlwgt operation OP=AVG(fnlwgt). MD repre-sents the metadata attached to the query. Each meta-data indicates which k and ` can be guaranteed after agiven generalization operation. Depending on the at-tribute type, generalizing an attribute may correspondto climbing up in a generalization hierarchy (for cat-egorical attributes such as Workclass or Race) or re-placing a value by an interval of greater width (fornumeric values such as Age or Education). Thedel operation means that the attribute is simply re-moved. The ordering of the metadata in MD translatesthe querier requirements.

GB= Age, Workclass, Education, Marital_status,Ocupation, Race, Gender, Native_Country;

OP= AVG(fnlwgt);MD= : k=5 l=3,

age->20: k=6 l=3,workclass->up: k=8 l=4,education->5: k=9 l=4,marital_status->up: k=9 l=4,occupation->up: k=10 l=4,race->up: k=11 l=5,gender->del: k=14 l=6,native country->del:k=15 l=7,age->40: k=17 l=8;

Figure 5: ki SQL/AA query example.

4 EXPERIMENTAL EVALUATION

We have implemented the kiSQL/AA protocol onequivalent open-source hardware used in (To et al.,2014; To et al., 2016). The goal of this experimentalsection is to show that there is very little overheadwhen taking into account personalized anonymityconstraints compared to the performance measured bythe original implementation of To et al.. Our imple-mentation is tested using the classical adult dataset ofUCI-ML (Lichman, 2013)

4.1 Implementation of kiSQL/AAprotocol

The implementation of kiSQL/AA builds upon thesecure aggregation protocol of SQL/AA (To et al.,2014; To et al., 2016), recalled in Section 2.3. Thesecure aggregation is based on non-deterministic en-cryption as AES-CBC to encrypt tuples. Each TDSencrypts its data with the same secret key but with adifferent initialization vector such that two tuples withthe same value have two different encrypted values.The protection ensures the SSI cannot infer personalinformations by the distribution of same encryptedvalues. To make sure that aggregations are entirely

Page 9: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

computed, the SSI uses a divide and conquer parti-tioning to make the TDS compute partial aggregationson partitions of data. Then the SSI merges partial ag-gregations to compute the last aggregation. The ag-gregation phase is illustrated by the Figure 6.

Final Aggregation

Agg(P1,2)Agg(P1)

Agg(P2)

Agg(P3,4)Agg(P3)

Agg(P4)

Figure 6: Aggregation phase with four partitions.

In kiSQL/AA, the aggregation phase has been keptunchanged from the SQL/AA system since our con-tribution is on the collection phase and the filteringphase. The algorithm implementing the collectionphase of kiSQL/AA is given below (see Algorithm 1).As in most works related to data anonymization, wemake the simplifying assumption that each individualis represented by a single tuple. Hence, the algorithmalways return a single tuple. This tuple is a dummy ifthe privacy constraints or the WHERE clause cannot besatisfied.

Algorithm 1 Collection Phase.procedure COLLECTION_PHASE(Query Q)

t← getTuple(Q)p← getConstraints(Q)g← getGuarantees(Q)i← 0if veri f yWhere(t,Q) then

while gi < p doif not canGeneralize(t) then

t← makeDummy(Q)else

t← nextGeneralization(t)i← i+1

end ifend while

elset← makeDummy(Q)

end ifreturn Encrypt(t)end procedure

The filtering phase algorithm is given by Algo-rithm 2.

First, the algorithm sorts tuples of the aggre-gation phase by generalization level, making mul-tiple sets of tuples of same generalization level.The function verifyHaving checks if COUNT(*) andCOUNT(Disctinct) match the anonymization guar-antees expected at this generalization level. If so,

Algorithm 2 Filtering Phase.procedure FILTERING_PHASE(Query Q, Tuples-Set T)

sortByGeneralizationLevel(T )g← getGuarantees(Q)for i f rom 0 to MaxGeneralizationLevel(Q) do

for t ∈ Ti dot← decrypt(t)if veri f yHaving(t,Q) then

result.addTuple(t)else if canGeneralize(t) then

t← nextGeneralization(t)Ti+1.addTuple(t)

end ifend for

end forreturn resultend procedure

the tuple is added to the result. Otherwise, it is fur-ther generalized and merged with the set of highergeneralization level. At the end, every tuple whichcannot reach the adequate privacy constraints, despiteachieving maximum generalization is not included inthe result.

4.2 Experiments Platform

SD card

Bluetooth

Fingerprintreader

Smartcard

(data managt)

(secrets)(data)

MCU

USB

Figure 7: TDS characteristics.

The performance of the kiSQL/AA protocol pre-sented above has been measured on the tamper re-sistant open-source hardware platform shown in Fig-ure 7. The TDS hardware platform consists of a32-bit ARM Cortex-M4 microcontroller with a max-imum frequency of 168MHz, 1MB of internal NORflash memory and 196kb of RAM, itself connectedto a µSD card containing all the personal data inan encrypted form and to a secure element (opensmartcard) containing cryptographic secrets and al-gorithms. The TDS can communicate either through

Page 10: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

USB (our case in this study) or Bluetooth. Finally,the TDS embeds a relational DBMS engine, namedPlugDB10, running in the microcontroller. PlugDBis capable to execute SQL statement over the localpersonal data stored in the TDS. In our context, it ismainly used to implement the WHERE clause duringthe Collection phase of kiSQL/AA.

Our performance tests use the Adult datasetfrom the UCI machine learning repository (Lich-man, 2013). This dataset is an extraction from theAmerican census bureau database. We modified thedataset the same way of (Iyengar, 2002; Bayardoand Agrawal, 2005). We kept eight attributes to per-form the GoupBy clause, namely age, workclass,education, marital status, occupation, race,native country and gender. Since our work isbased on GROUP BY queries, we also kept thefnlwgt (i.e. final weight) attribute to perform an AVGon it. The final weight is a computed attribute giv-ing similar value for people with similar demographiccharacteristics. We also removed each tuple with amissing value. At the end we kept 30162 tuples. At-tributes age and education are treated as numericvalues and others as categorical values. Since TDShave limited resources, categorical value are repre-sented by a bit vector. For instance, the categoricalattribute workclass is represented by a 8 bits valueand its generalization process is performed by takingthe upper node on the generalization tree given in Fig-ure 8. The native country attribute is the largest andrequires 49 bits to be represented.

Private0x80

Self-emp-not-inc0x40

Self-emp-inc0x20

Federal-gov0x10

Locale-gov0x08

State-gov0x04

Without-pay0x02

Never-worked0x01

Gov-emp0x1c

Self-emp0x60

without-pay0x03

Emp0xfc

*0xff

Figure 8: Generalization tree of workclass attribute.

4.3 Performance measurements

kiSQL/AA being an extension of SQL/AA protocol,this section focuses on the evaluation of the overheadincurred by the introduction of anonymisation guar-antees in the query protocol. Then, it sums up to a di-rect comparison between kiSQL/AA and the genuineSQL/AA. To make the performance evaluation morecomplete, we first recall from (To et al., 2016) thecomparison between SQL/AA itself and other state ofthe art methods to securely compute aggregate SQLqueries. This comparison is pictured in Figure 9.

10https://project.inria.fr/plugdb/en/

The Paillier curve shows the performance to computeaggregation in a secure centralized server using ho-momorphic encryption, presented in (Ge and Zdonik,2007). The DES curve uses also a centralized serverand a DES encryption scheme (data are decrypted atcomputation time). Finally, SC curves correspond tothe SQL/AA computation with various numbers ofgroups G (i.e. defined by GroupBy clause). This fig-ure shows the strength of the massively parallel cal-culation of TDSs when G is small and its limits whenG is really too big. We compare next the overheadintroduced by our contribution to SQL/AA.

0

5

10

15

20

25

30

35

40

5 15 25 35 45 55

TQ (

seco

nd

)

Nt (millions)

SC (G=1)

SC (G=100)

SC (G=1000)

SC (G=10000)

Plaintext

Paillier

DES

Figure 9: Performance measurements of SQL/AA and stateof the art.

Categorical vs. numeric values. We ran a querywith one hundred generalization levels, using first cat-egorical, then numerical values. Execution time wasexactly the same, demonstrating that the cost of gen-eralizing categorical or numerical values is indiffer-ent.

Collection and Filtering Phases. Figures 10 and 11show the overhead introduced by our approach, re-spectively on the collection and filtering phases. Thetime corresponds to processing every possible gener-alization of the query presented in Figure 5, whichgenerates the maximal overhead (i.e., worst case) forour approach. The SQL/AA bar corresponds to the ex-ecution cost of the SQL/AA protocol inside the TDS,the data transfer bar corresponds to the total timespent sending the query and retrieving the tuple (ap-proximately 200 bytes at an experimentally measureddata transfer rate of 7.9Mbits/sec), the TDS platformbar corresponds to the internal cost of communicatingbetween the infrastructure and the TDS (data trans-fer excluded), and the Privacy bar corresponds to theoverhead introduced by the kiSQL/AA approach. Alltimes are indicated in milliseconds. Values are aver-aged over the whole dataset (i.e. 30K tuples).

Collection Phase Analysis. The overhead of thecollection phase resides in deciding how to generalizethe tuple in order to comply with the local privacy re-quirements, and the global query privacy constraints.Figure 10 shows that our protocol introduces a 0.25msoverhead for a total average execution time of 3.5ms,

Page 11: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

thus under 10% which we consider is a very reason-able cost.

0

0.5

1

1.5

2

2.5

3

3.5

Tim

e (

ms)

data transfertSQL/AAPrivacyTDS platform

Figure 10: Collection Phase Execution Time Breakdown.

Filtering Phases Analysis. Figure 11 shows break-down of the filtering phase execution time. The filter-ing phase takes places once the TDSs have computedall the aggregations and generalizations. The limitedresources of the TDSs are bypassed by the SQL/AAsystem with the help of the (distributed) aggregationphase. Since every group is represented by one tu-ple, the TDS which computes the filtering phase re-ceives a reduced amount of tuples (called G). To etal. have shown that the SQL/AA protocol convergesif it is possible for a given TDS to compute G groupsduring the aggregation phase. As this is the numberof tuples that will be processed during the filteringphase, we know that if G is under the threshold toallow its computation via the distributed agregationphase, then it will be possible to compute the filter-ing phase with our improved protocol. Once again,measurements show that the overhead introduced bykiSQL/AA is of only 4% compared to the overall costof this phase : the overhead introduced is of 0.42mscompared to a total cost of 9.8ms.

0

1

2

3

4

5

6

7

8

9

10

Tim

e (

ms)

data transfertSQL/AAPrivacyTDS platform

Figure 11: Filtering Phase Execution Time Breakdown.

5 CONCLUSION

In this paper, we presented a novel approach to de-fine and enforce personalized anonymity constraintson SQL GROUP BY queries. To the best of ourknowledge, this is the first approach targeting this is-sue. To this end, we extended the SQL/AA protocoland implemented our solution on secure hardware to-kens (TDS). Our experiments show that our approachis clearly useable, with an overhead of a few percenton a total execution time compared with the genuineSQL/AA protocol.

Our current work involves investigating the qual-ity of the anonymization produced, in presence ofdifferent anonymization constraints for each individ-ual. We firmly believe that introducing personalizedanonymity constraints in database queries and provid-ing a secure decentralized query processing frame-work to execute them gives substance to the user’sempowerment principle called today by all legisla-tions regulating the use of personal data.

REFERENCES

Abiteboul, S., André, B., and Kaplan, D. (2015). Managingyour digital life. Commun. ACM, 58(5):32–35.

Anciaux, N., Bonnet, P., Bouganim, L., Nguyen, B., Popa,I. S., and Pucheral, P. (2013). Trusted cells: A seachange for personal data services. In CIDR 2013,Sixth Biennial Conference on Innovative Data SystemsResearch, Asilomar, CA, USA, January 6-9, 2013, On-line Proceedings.

Bayardo, R. J. and Agrawal, R. (2005). Data privacythrough optimal k-anonymization. In Proceedingsof the 21st International Conference on Data Engi-neering, ICDE ’05, pages 217–228, Washington, DC,USA. IEEE Computer Society.

Dwork, C. (2006). Differential privacy. In Proceedingof the 39th International Colloquium on Automata,Languages and Programming, volume 4052 of Lec-ture Notes in Computer Science, pages 1–12. SpringerBerlin / Heidelberg.

European Union (2014). ARTICLE 29 DATA PRO-TECTION WORKING PARTY: Opinion 05/2014 onAnonymisation Techniques.

European Union (2016). Regulation (EU) 2016/679 of theEuropean Parliament and of the Council of 27 April2016 on the protection of natural persons with re-gard to the processing of personal data and on thefree movement of such data, and repealing Directive95/46/EC (General Data Protection Regulation). Offi-cial Journal of the European Union, L119/59.

Flajolet, P., Fusy, ï., Gandouet, O., and Meunier, F. (2007).Hyperloglog: The analysis of a near-optimal cardinal-ity estimation algorithm. In Proceedings of the 2007International conference on Analysis of Algorithms(AOFA’07).

Page 12: Managing Distributed Queries under Personalized Anonymity ... · Managing Distributed Queries under Personalized Anonymity Constraints Axel Michel 1; 2, Benjamin Nguyen and Philippe

Ge, T. and Zdonik, S. (2007). Answering aggregationqueries in a secure system model. In Proceedingsof the 33rd International Conference on Very LargeData Bases, VLDB ’07, pages 519–530. VLDB En-dowment.

Iyengar, V. S. (2002). Transforming data to satisfy pri-vacy constraints. In Proceedings of the EighthACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, KDD ’02, pages279–288, New York, NY, USA. ACM.

Li, N., Li, T., and Venkatasubramanian, S. (2010). Close-ness: A new privacy measure for data publishing.IEEE Trans. Knowl. Data Eng., 22(7):943–956.

Lichman, M. (2013). UCI machine learning repository.Machanavajjhala, A., Gehrke, J., Kifer, D., and Venkita-

subramaniam, M. (2006). l-diversity: Privacy beyondk-anonymity. In Proceedings of the 22nd Interna-tional Conference on Data Engineering, ICDE 2006,3-8 April 2006, Atlanta, GA, USA, page 24.

Sweeney, L. (2002). k-anonymity: A model for protectingprivacy. International Journal of Uncertainty, Fuzzi-ness and Knowledge-Based Systems, 10(5):557–570.

To, Q., Nguyen, B., and Pucheral, P. (2014). SQL/AA: ex-ecuting SQL on an asymmetric architecture. PVLDB,7(13):1625–1628.

To, Q.-C., Nguyen, B., and Pucheral, P. (2016). Private andscalable execution of sql aggregates on a secure de-centralized architecture. ACM Trans. Database Syst.,41(3):16:1–16:43.

Trabelsi, S., Neven, G., Raggett, D., Ardagna, C., and et al.(2011). Report on design and implementation. Tech-nical report, PrimeLife Deliverable.