![Page 1: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/1.jpg)
Trio: A System for Data, Uncertainty, and Lineage
Jennifer Widom
![Page 2: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/2.jpg)
2
Outline of Talk
1. The Motivation
2. The Discovery
3. The Vision
4. The Present
5. The Future
![Page 3: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/3.jpg)
3
The Motivation
• Lots of applications have uncertain data(approximate, incomplete, imprecise, inaccurate, ...)
• Lots of the same applications need to track data lineage
• Neither is supported by conventional Database Management Systems (DBMSs)
Coincidence or Fate?
![Page 4: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/4.jpg)
4
Applications
Deduplication• Uncertainty: Match and merge• Lineage: Source records
Information extraction• Uncertainty: Extracted labels and values• Lineage: Original context
Information integration• Uncertainty: Inconsistent information• Lineage: Original sources
![Page 5: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/5.jpg)
5
Applications
Scientific experiments• Uncertainty: Captured (and derived) data
• Lineage: Layers of views
Sensor data• Uncertainty: Sensor values, missing readings
• Lineage: Original readings, views
![Page 6: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/6.jpg)
6
The Discovery
The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Coincidence or Fate?
![Page 7: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/7.jpg)
7
Lineage and Uncertainty
Lineage...• Enables simple and consistent representation of
uncertain data
• Correlates uncertainty in query results with uncertainty in the input data
• Can make computation over uncertain data more efficient
Applications use lineage to reduce or resolve uncertainty
![Page 8: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/8.jpg)
8
The Vision
A new kind of DBMS in which:1. Data2. Uncertainty3. Lineage
are all first-class interrelated concepts
Trio
![Page 9: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/9.jpg)
9
The Trio Trio
1. Data ModelSimplest extension to relational model that’s sufficiently expressive
2. Query LanguageSimple extension to SQL with well-defined semantics and intuitive behavior
3. SystemA complete open-source DBMS that people want to use
![Page 10: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/10.jpg)
10
The Present
1. Data ModelUncertainty-Lineage Databases (ULDBs)
2. Query LanguageTriQL
3. SystemFirst prototype built on top of standard DBMS
![Page 11: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/11.jpg)
11
Running Example: Crime-Solving
Saw(witness,car) // may be uncertain
Owns(owner,car) // may be uncertain
Suspects(person) = πowner(Saw ⋈ Owns)
![Page 12: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/12.jpg)
12
Data Model: Uncertainty
An uncertain database represents a set ofpossible instances
• Amy saw either a Honda or a Toyota
• Jimmy owns a Toyota, a Mazda, or both
• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3
• Hank is a suspect with confidence 0.7
![Page 13: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/13.jpg)
13
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences
![Page 14: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/14.jpg)
14
Our Model for Uncertainty
1. Alternatives: uncertainty about value
2. ‘?’ (Maybe) Annotations
3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
witness car
Amy { Honda, Toyota, Mazda }=
Three possible instances
![Page 15: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/15.jpg)
15
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe): uncertainty about existence
3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
(Betty, Acura)?
Six possible instances
![Page 16: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/16.jpg)
16
Our Model for Uncertainty
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences: weighted uncertainty
Saw (witness,car)
(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2
(Betty, Acura): 0.6?
Six possible instances,each with a probability
![Page 17: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/17.jpg)
17
Models for Uncertainty
• Our model (so far) is not especially new
• We spent some time exploring the space of models for uncertainty [two papers]
• Tension between understandability and expressiveness– Our model is understandable
– But it is not complete, or even closed under common operations
![Page 18: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/18.jpg)
18
Closure and Completeness
CompletenessCan represent all sets of possible instances
ClosureCan represent results of operations
Note: Completeness ⇒ Closure
![Page 19: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/19.jpg)
19
Our Model is Not Closed
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Owns (owner,car)
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
(Billy, Honda)
(Hank, Honda)
Suspects
Jimmy
Billy
Hank
Suspects = πowner(Saw ⋈ Owns)
???
Does not correctlycapture possibleinstances in theresult
Cannot
![Page 20: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/20.jpg)
20
to the Rescue
Lineage (provenance): “where data came from”• Internal lineage
• External lineage
In Trio: A function λ from alternatives to other alternatives (or external sources)
Lineage
![Page 21: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/21.jpg)
21
Example with Lineage
ID Saw (witness,car)
11 (Cathy, Honda) ∥ (Cathy, Mazda)
ID Owns (owner,car)
21 (Jimmy, Toyota) ∥ (Jimmy, Mazda)
22 (Billy, Honda)
23 (Hank, Honda)
ID Suspects
31 Jimmy
32 Billy
33 Hank
???
Suspects = πowner(Saw ⋈ Owns)
λ(31) = (11,2),(21,2)λ(32) = (11,1), 22λ(33) = (11,1), 23
Correctlycaptures possibleinstances in theresult
![Page 22: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/22.jpg)
22
Trio Data Model
[recent paper]
1. Alternatives
2. ‘?’ (Maybe) Annotations
3. Confidences
4. Lineage
ULDBs are closed and complete
Uncertainty-Lineage Databases (ULDBs)
![Page 23: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/23.jpg)
23
ULDB Results
Conjunctive lineage sufficient for most operations• Negative lineage for difference
• Disjunctive lineage for duplicate-elimination
Minimality of representations• Data-minimal
• Lineage-minimal
Membership problems
Extraction of a relation from a ULDB
![Page 24: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/24.jpg)
24
Querying ULDBs
• Simple extension to SQL
• Formal semantics, intuitive meaning
• Ability to query confidences and lineage directly
TriQL
![Page 25: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/25.jpg)
25
TriQL Example
ID Saw (witness,car)
11 (Cathy, Honda) ∥ (Cathy, Mazda)
ID Owns (owner,car)
21 (Jimmy, Toyota) ∥ (Jimmy, Mazda)
22 (Billy, Honda)
23 (Hank, Honda)
ID person
31 Jimmy
32 Billy
33 Hank
???
SELECT Owns.person INTO SuspectsFROM Saw, OwnsWHERE Saw.car = Owns.car
λ(31) = (11,2),(21,2)λ(32) = (11,1), 22λ(33) = (11,1), 23
![Page 26: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/26.jpg)
26
Formal Semantics
Query Q on ULDB D
D
D1, D2, …, Dn
possibleinstances
Q on eachinstance
representationof instances
Q(D1), Q(D2), …, Q(Dn)
D’implementation of Q
operational semantics
D + Result
![Page 27: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/27.jpg)
27
TriQL: Querying Confidences
Built-in function: conf()
SELECT Owns.person INTO SuspectsFROM Saw, OwnsWHERE Saw.car = Owns.carAND conf(Saw) > 0.5 AND conf(Owns) > 0.8
![Page 28: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/28.jpg)
28
TriQL: Querying Lineage
Built-in join predicate: lineage()
SELECT Saw.witness INTO AccusesHankFROM Suspects, SawWHERE lineage(Suspects,Saw)AND Suspects.person = ‘Hank’
Also lineage*()
![Page 29: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/29.jpg)
29
Computing Confidences
Previous approach (probabilistic databases):• Each operator computes confidences during query
execution
• Only certain query plans allowed
Our approach• Use any query plan
• Compute confidences afterwards based on lineage
![Page 30: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/30.jpg)
30
The Trio System
Version 1Entirely on top of conventional DBMS
Surprisingly easy and complete, reasonably efficient
![Page 31: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/31.jpg)
31
The Trio System: Version 1
Lin:R aid table aid
R xid aid C
Relational DBMS
create trio table T(A,B)
select C into R ...
TrioMetadata
Trio APISQL commands
• Result cursors• Browse tables• Explore lineage
T xid aid A B
![Page 32: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/32.jpg)
32
The Trio System: Version 1
Relational DBMS
create trio table T(A,B)
select C into R ...
Trio API
• Result cursors• Browse tables• Explore lineage
Command-line client
Lin:R aid table aid
R xid aid CTrio
Metadata
T xid aid A B
![Page 33: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/33.jpg)
33
The Trio System: Version 2
Relational DBMS
create trio table T(A,B)
select C into R ...
Trio API
• Result cursors• Browse tables• Explore lineage
GUI client
SpecializedTrio Processing
SpecializedTrio Structures Trio
Metadata
![Page 34: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/34.jpg)
34
Current Topics
Confidence computation• Minimize lineage traversal; memoization; batch
computations
Updates• Primitive operations; TriQL update statements
Additional query constructs• “Horizontal” operators; top-k by confidence
System• Keep up with research; GUI
![Page 35: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/35.jpg)
35
Future Directions
Theory, Model, Algorithms• Unlimited opportunities
System• Storage, indexing, partitioning• Statistics and query optimization
Long Range• Continuous uncertainty; incomplete relations• External lineage; versioning
![Page 36: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The](https://reader033.vdocument.in/reader033/viewer/2022050104/5f42f7df4bdaf337c20f74ee/html5/thumbnails/36.jpg)
Search “stanford trio”[overview paper]
Trio group:Parag Agrawal, Omar Benjelloun, Anish Das Sarma,
Chris Hayworth, Shubha Nabar, Jennifer Widom
Special thanks to:Ashok Chandra, Alon Halevy, Jeff Ullman
but don’t forgetthe lineage…