esteem: quality- and privacy-aware data integration monica scannapieco, carola aiello, tiziana...
TRANSCRIPT
![Page 1: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/1.jpg)
ESTEEM:Quality- and Privacy-Aware
Data Integration
Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano
Dipartimento di Informatica e SistemisticaUniversità di Roma “La Sapienza”
![Page 2: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/2.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 2
Outline
Privacy-aware integration– Privacy risk assessment – Private record linkage
Quality-aware integration– Flexible and fully automatic record linkage
Summary
New!!!
New!!!
![Page 3: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/3.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 3
PrivateID SSN DOB ZIP Health_Problem
a 11/20/67 00198 Shortness of breath
b 02/07/81 00159 Headache
c 02/07/81 00156 Obesity
d 08/07/76 00198 Shortness of breath
PrivateID SSN DOB ZIP Employment Marital Status
1 A 11/20/67 00198 Researcher Married
5 E 08/07/76 00114 Private Employee
Married
3 C 02/07/81 00156 Public Employee
Widow
T1
T2
Linkage of Anonymous Data
QUASI-IDENTIFIER
![Page 4: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/4.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 4
Our Proposal
A framework for assessing privacy risk that takes into accounts both facets of privacy– based on statistical decision theory
Definition and analysis of: – disclosure policies modelled by disclosure rules – several privacy risk functions
Estimated risk as an upper-bound of true risk and related complexity analysis
Algorithm for finding the disclosure rule minimizing the privacy risk
![Page 5: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/5.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 5
The Formal Framework
x13x12x11
x33x32x31
x23x22x21
CBA
x13x12x11
x33x32x31
x23x22x21
CBA
x11
x33x31
x21
CBA
x11
x33x31
x21
CBA
Disclosure Rule δ
Loss function l(δ,) - representing attacker’s knowledge
Risk R(δ,)=f(l(δ,) )• identification• sensitivity
![Page 6: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/6.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 6
K-anonimity K anonimity is SIMPLY a special case of
our framework in which:1. θtrue= relation T, more strict assumption on
the attacker’s knowledge. We proved that under some assumption we can bound the true risk by our “more general” risk
2. is a costant, questionable: independence on the type of disclosed attributes (HIV result same loss as last doctor visit)
3. is underspecified, we can specify the set of disclosure rules in several ways
Our framework underlies some questionable
hypotheses of k-anonimity!!!
![Page 7: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/7.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 7
Private Record Linkage
Being P and Q be two peers owning the relations RP (A1,…An) and RQ(B1,…,Bn), respectively, the privacy-preserving record matching problem is to perform record matching between RP and RQ, such that at the end of the process – P will know only a set PMatch, consisting of records in RP that
match with records in RQ. Similarly Q will know only the set QMatch.
Of particular importance is that no information will be revealed to P and Q concerning records that do not match each other
Published at SIGMOD 07
![Page 8: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/8.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 8
Key Ideas and Solutions (1)
Cannot just encrypt data and then compute distances among them– by definition encryption functions do not
preserve distances Let’s work on numbers, instead of
records!!! Mapping of records in a vector space, and
record matching performed in such a space
![Page 9: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/9.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 9
Key Ideas and Solutions (2)
Third-party based protocol in which:– The two parties build together the embedding space
by using a method (SparseMap) with “secure” features
– Each of the two parties embeds its own dataset and sends it to the third party
– The third party W performs the intersection and sends back to the parties
Mapping of records in a vector space, and record matching performed in such a space
![Page 10: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/10.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 10
Key Ideas and Solutions (3)
Th1: Given the two relations RP (D1,…,Ds) and RQ (D1,…,Dx), the set of matching records RecMatch, DBSize the database, the following result is proven, the record matching protocol ¯finds the matched records between the two relations with the following assurance: – RecMatch is not disclosed to W;
– RP - RecMatch is not disclosed to Q
– RQ - RecMatch is not disclosed to P
– DBSize is disclosed to W and bounded by P and Q
![Page 11: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/11.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 11
Schema Matching Features
Th2: Given the schemas RP and RQ, owned by parties P and Q respectively and the set of matching attributes AttrMatch, the schema matching protocol finds the attributes common to the two schemas with the following assurance: – AttrMatch is not disclosed to W – AttrMatch is not disclosed to P and Q– AttrMatchSize is not disclosed to P and Q
![Page 12: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/12.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 12
How good are we?
Time: better than record linkage without privacy preservation
Effectiveness: Comparable wrt recall and precision
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1Threshold
Pre
cisi
on
an
d R
ecal
l
Recall OriginalPrecision OriginalRecall EmbeddedPrecision Embedded
0
2000
4000
6000
8000
10000
12000
14000
16000
0 5000 10000 15000 20000 25000Dataset Size
To
tal
Ex
ec
uti
on
Tim
e (
se
cs
)
Our Method
Record Matching in theOriginal Space
![Page 13: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/13.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 13
Flexible and Automatic RL
P2P systems are loosely coupled, dynamic, open
Manual phases of record linkage can be problematic:– Time consuming vs. dynamic feature/open– Syncronous interactions vs. loosely coupled
systems
Need for flexible and automatic RL
![Page 14: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/14.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 14
Background: Record Linkage Techniques
Search Space Reduction: – Sorted Neighborhood
Method– Blocking– Hierarchical grouping– …
Decision Rules:– Probabilistic:
Fellegi&Sunter– Empirical– Knowledge-based
Comparison Functions:– Edit distance– Smith-Waterman– Q-grams– Jaro string comparator– Soundex code– TF-IDF– …
![Page 15: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/15.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 15
Key Idea
Record Linkage is a complex process and should be decomposed as much as possible in its constituting phases
For each phase the most appropriate technique should be chosen depending on application and data requirements
In order to dynamically build ad-hoc record linkage workflows
RELAIS: toolkit serving such a purpose– developed at Istat– UNIROMA contribution on data profiling stuff (wait a couple of
slides )
![Page 16: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/16.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 16
RELAIS Toolkit
RELAIS
Application Constraints:• Admissible error-rates• Privacy issues• Cost• …
Database Features:• Size• Quality• Domain features• …
Record Linkage Workflow
![Page 17: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/17.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 17
RL WorkflowsPreprocessing
Search Space Reduction
Comparison Function
Decision Model
Normalization
UpperLowerCase
Schema reconciliation
Blocking
SNM
Edit Distance
Jaro
Equality
Probabilistic
Empirical
RecLink WF Appl2
SNM
Probabilistic
RecLink WF Appl1
Normalization
UpperLowerCase
Blocking
Jaro
Empirical
Equality
![Page 18: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/18.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 18
Making Automatic Some Phases
Data profiling for choosing matching keys Automatic extraction of:
– Completeness – Consistency– Identification power
On going
![Page 19: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/19.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 19
Status of RELAIS
Currently guided execution of RL workflows with all phases automatic
Future:– Definition of RELAIS's architecture as a service-
oriented, web-accessible architecture. Formal specification of (i) input/output of services, and (ii) pre/post conditions by semantic Web Services technologies
– Automatic generation of RL workflows by reasoning on service specification usage of either automatic [Berardi et al VLDB 2005] or semi automatic [Bouguettaya et al. VLDBJ 2003] service composition techniques
![Page 20: ESTEEM: Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica](https://reader036.vdocument.in/reader036/viewer/2022082709/56649d095503460f949dbdec/html5/thumbnails/20.jpg)
Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia 20
Implementation View
PQ-RELAIS
Q-RELAIS
P-RELAIS
• Data Source profiling (quality metadata)• Quality-based trust evaluation• Automatic and flexible RL
• Privacy risk assessment• Private RL