esteem: quality- and privacy-aware data integration monica scannapieco, carola aiello, tiziana...

20
ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”

Upload: bertina-pitts

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

ESTEEM:Quality- and Privacy-Aware

Data Integration

Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano

Dipartimento di Informatica e SistemisticaUniversità di Roma “La Sapienza”

Page 2: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 2

Outline

Privacy-aware integration– Privacy risk assessment – Private record linkage

Quality-aware integration– Flexible and fully automatic record linkage

Summary

New!!!

New!!!

Page 3: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 3

PrivateID SSN DOB ZIP Health_Problem

a 11/20/67 00198 Shortness of breath

b 02/07/81 00159 Headache

c 02/07/81 00156 Obesity

d 08/07/76 00198 Shortness of breath

PrivateID SSN DOB ZIP Employment Marital Status

1 A 11/20/67 00198 Researcher Married

5 E 08/07/76 00114 Private Employee

Married

3 C 02/07/81 00156 Public Employee

Widow

T1

T2

Linkage of Anonymous Data

QUASI-IDENTIFIER

Page 4: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 4

Our Proposal

A framework for assessing privacy risk that takes into accounts both facets of privacy– based on statistical decision theory

Definition and analysis of: – disclosure policies modelled by disclosure rules – several privacy risk functions

Estimated risk as an upper-bound of true risk and related complexity analysis

Algorithm for finding the disclosure rule minimizing the privacy risk

Page 5: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 5

The Formal Framework

x13x12x11

x33x32x31

x23x22x21

CBA

x13x12x11

x33x32x31

x23x22x21

CBA

x11

x33x31

x21

CBA

x11

x33x31

x21

CBA

Disclosure Rule δ

Loss function l(δ,) - representing attacker’s knowledge

Risk R(δ,)=f(l(δ,) )• identification• sensitivity

Page 6: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 6

K-anonimity K anonimity is SIMPLY a special case of

our framework in which:1. θtrue= relation T, more strict assumption on

the attacker’s knowledge. We proved that under some assumption we can bound the true risk by our “more general” risk

2. is a costant, questionable: independence on the type of disclosed attributes (HIV result same loss as last doctor visit)

3. is underspecified, we can specify the set of disclosure rules in several ways

Our framework underlies some questionable

hypotheses of k-anonimity!!!

Page 7: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 7

Private Record Linkage

Being P and Q be two peers owning the relations RP (A1,…An) and RQ(B1,…,Bn), respectively, the privacy-preserving record matching problem is to perform record matching between RP and RQ, such that at the end of the process – P will know only a set PMatch, consisting of records in RP that

match with records in RQ. Similarly Q will know only the set QMatch.

Of particular importance is that no information will be revealed to P and Q concerning records that do not match each other

Published at SIGMOD 07

Page 8: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 8

Key Ideas and Solutions (1)

Cannot just encrypt data and then compute distances among them– by definition encryption functions do not

preserve distances Let’s work on numbers, instead of

records!!! Mapping of records in a vector space, and

record matching performed in such a space

Page 9: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 9

Key Ideas and Solutions (2)

Third-party based protocol in which:– The two parties build together the embedding space

by using a method (SparseMap) with “secure” features

– Each of the two parties embeds its own dataset and sends it to the third party

– The third party W performs the intersection and sends back to the parties

Mapping of records in a vector space, and record matching performed in such a space

Page 10: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 10

Key Ideas and Solutions (3)

Th1: Given the two relations RP (D1,…,Ds) and RQ (D1,…,Dx), the set of matching records RecMatch, DBSize the database, the following result is proven, the record matching protocol ¯finds the matched records between the two relations with the following assurance: – RecMatch is not disclosed to W;

– RP - RecMatch is not disclosed to Q

– RQ - RecMatch is not disclosed to P

– DBSize is disclosed to W and bounded by P and Q

Page 11: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 11

Schema Matching Features

Th2: Given the schemas RP and RQ, owned by parties P and Q respectively and the set of matching attributes AttrMatch, the schema matching protocol finds the attributes common to the two schemas with the following assurance: – AttrMatch is not disclosed to W – AttrMatch is not disclosed to P and Q– AttrMatchSize is not disclosed to P and Q

Page 12: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 12

How good are we?

Time: better than record linkage without privacy preservation

Effectiveness: Comparable wrt recall and precision

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1Threshold

Pre

cisi

on

an

d R

ecal

l

Recall OriginalPrecision OriginalRecall EmbeddedPrecision Embedded

0

2000

4000

6000

8000

10000

12000

14000

16000

0 5000 10000 15000 20000 25000Dataset Size

To

tal

Ex

ec

uti

on

Tim

e (

se

cs

)

Our Method

Record Matching in theOriginal Space

Page 13: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 13

Flexible and Automatic RL

P2P systems are loosely coupled, dynamic, open

Manual phases of record linkage can be problematic:– Time consuming vs. dynamic feature/open– Syncronous interactions vs. loosely coupled

systems

Need for flexible and automatic RL

Page 14: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 14

Background: Record Linkage Techniques

Search Space Reduction: – Sorted Neighborhood

Method– Blocking– Hierarchical grouping– …

Decision Rules:– Probabilistic:

Fellegi&Sunter– Empirical– Knowledge-based

Comparison Functions:– Edit distance– Smith-Waterman– Q-grams– Jaro string comparator– Soundex code– TF-IDF– …

Page 15: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 15

Key Idea

Record Linkage is a complex process and should be decomposed as much as possible in its constituting phases

For each phase the most appropriate technique should be chosen depending on application and data requirements

In order to dynamically build ad-hoc record linkage workflows

RELAIS: toolkit serving such a purpose– developed at Istat– UNIROMA contribution on data profiling stuff (wait a couple of

slides )

Page 16: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 16

RELAIS Toolkit

RELAIS

Application Constraints:• Admissible error-rates• Privacy issues• Cost• …

Database Features:• Size• Quality• Domain features• …

Record Linkage Workflow

Page 17: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 17

RL WorkflowsPreprocessing

Search Space Reduction

Comparison Function

Decision Model

Normalization

UpperLowerCase

Schema reconciliation

Blocking

SNM

Edit Distance

Jaro

Equality

Probabilistic

Empirical

RecLink WF Appl2

SNM

Probabilistic

RecLink WF Appl1

Normalization

UpperLowerCase

Blocking

Jaro

Empirical

Equality

Page 18: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 18

Making Automatic Some Phases

Data profiling for choosing matching keys Automatic extraction of:

– Completeness – Consistency– Identification power

On going

Page 19: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 19

Status of RELAIS

Currently guided execution of RL workflows with all phases automatic

Future:– Definition of RELAIS's architecture as a service-

oriented, web-accessible architecture. Formal specification of (i) input/output of services, and (ii) pre/post conditions by semantic Web Services technologies

– Automatic generation of RL workflows by reasoning on service specification usage of either automatic [Berardi et al VLDB 2005] or semi automatic [Bouguettaya et al. VLDBJ 2003] service composition techniques

Page 20: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica

Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 20

Implementation View

PQ-RELAIS

Q-RELAIS

P-RELAIS

• Data Source profiling (quality metadata)• Quality-based trust evaluation• Automatic and flexible RL

• Privacy risk assessment• Private RL