![Page 1: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/1.jpg)
UncertaintyLineage
DataBases
VeryLargeDataBases
19752006
![Page 2: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/2.jpg)
ULDBs: Databases with ULDBs: Databases with Uncertainty and LineageUncertainty and Lineage
Omar Benjelloun, Anish Das Sarma,
Alon Halevy, Jennifer WidomStanford InfoLab
![Page 3: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/3.jpg)
33
MotMotivivationation
• Many applications involve data that is uncertain (approximate, probabilistic, inexact, incomplete,
imprecise, fuzzy, inaccurate, ...)
• Many of the same applications need to track the lineage of their data
Neither uncertainty nor lineage are supported by conventional DBMSs
Coincidence or Fate?Coincidence or Fate?
![Page 4: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/4.jpg)
44
Sample Applications Needing Sample Applications Needing Uncertainty and LineageUncertainty and Lineage
•Scientific databases
•Sensor databases
•Data cleaning
•Data integration
• Information extraction
![Page 5: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/5.jpg)
55
Trio ProjectTrio Project
Building a new kind of DBMS in which:1. Data2. Uncertainty3. Lineage
are all first-class interrelated concepts
![Page 6: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/6.jpg)
66
Lineage and UncertaintyLineage and Uncertainty
•Lots of independent work in lineage and uncertainty (related work at end of talk)
•Turns out: The connection between uncertainty and lineage goes deeper than just a shared need by several applications
Coincidence or Fate?Coincidence or Fate?
![Page 7: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/7.jpg)
77
Lineage and UncertaintyLineage and Uncertainty
• Lineage...1) Enables simple and consistent
representation of uncertain data2) Correlates uncertainty in query results
with uncertainty in the input data3) Can make computation over uncertain
data more efficient
![Page 8: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/8.jpg)
88
Outline of the TalkOutline of the Talk
•The ULDB data model
•Querying ULDBs
•ULDB properties
•Membership and extraction operations
•Confidences
•Current, related, and future work
![Page 9: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/9.jpg)
99
Running Example: Crime Running Example: Crime SolverSolver
Saw(witness,car) Drives(person,car)
Suspects(person) = πperson(Saw ⋈ Drives)
![Page 10: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/10.jpg)
1010
UncertaintyUncertainty
•An uncertain database represents a set of possible instances. Examples:– Amy saw either a Honda or a Toyota
– Jimmy drives a Toyota, a Mazda, or both
– Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3
– Hank is a suspect with confidence 0.7
![Page 11: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/11.jpg)
1111
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences
![Page 12: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/12.jpg)
1212
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives: uncertainty about value
2. ‘?’ (Maybe) Annotations3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
Three possibleinstances
witness carAmy { Honda, Toyota,
Mazda }
=
![Page 13: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/13.jpg)
1313
Six possibleinstances
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives2. ‘?’ (Maybe): uncertainty about
presence3. Confidences
Saw (witness,car)
(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)
(Betty, Acura)?
![Page 14: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/14.jpg)
1414
Uncertainty in a ULDBUncertainty in a ULDB
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences: weighted
uncertaintySaw (witness,car)
(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2
(Betty, Acura): 0.6?
Six possible instances, each with a probability
![Page 15: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/15.jpg)
1515
Data Models for Data Models for UncertaintyUncertainty•Our model (so far) is not especially new•We spent some time exploring the
space of models for uncertainty [ICDE 2006]
•Tension between understandability and expressiveness– Our model is understandable– But it is not complete, or even closed under
common operations
![Page 16: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/16.jpg)
1616
Closure and Closure and CompletenessCompleteness
•Completeness Can represent all sets of possible
instances
•Closure Can represent results of operations
•Note: Completeness Closure
![Page 17: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/17.jpg)
1717
Model (so far) Not ClosedModel (so far) Not Closed
Saw (witness,car)
(Cathy, Honda) ∥ (Cathy, Mazda)
Drives (person,car)
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
(Billy, Honda) ∥ (Frank, Honda)
(Hank, Honda)
Suspects
Jimmy
Billy ∥ Frank
Hank
Suspects = πperson(Saw ⋈ Drives)
?
??
Does not correctlycapture possibleinstances in theresult
CANNOT
![Page 18: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/18.jpg)
1818
LineageLineage to the Rescue to the Rescue
•Lineage: “where data came from”– Internal lineage– External lineage (not covered in this
talk)
• In ULDBs: A function λ from alternatives to sets of alternatives (or external sources)
![Page 19: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/19.jpg)
1919
Example with LineageExample with Lineage
ID Saw (witness,car)
11
(Cathy, Honda) ∥ (Cathy, Mazda)
ID Drives (person,car)
21
(Jimmy, Toyota) ∥ (Jimmy, Mazda)
22
(Billy, Honda) ∥ (Frank, Honda)
23
(Hank, Honda)
ID Suspects
31
Jimmy
32
Billy ∥ Frank
33
Hank
?
?
?
Suspects = πperson(Saw ⋈ Drives) λ(31) = (11,2),(21,2)
λ(32,1) = (11,1),(22,1); λ(32,2) = (11,1),(22,2)
λ(33) = (11,1), 23
Correctly captures possible instances inthe result
![Page 20: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/20.jpg)
2020
ULDBsULDBs
1. Alternatives2. ‘?’ (Maybe) Annotations3. Confidences4. Lineage
ULDBs are Closed and Complete
![Page 21: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/21.jpg)
2121
Outline of the TalkOutline of the Talk
•The ULDB data model
•Querying ULDBs
•ULDB properties
•Membership and extraction operations
•Confidences
•Current, related, and future work
![Page 22: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/22.jpg)
2222
Querying ULDBsQuerying ULDBs
•Query Q on ULDB D
DD
D1, D2, …, DnD1, D2, …, Dn
possibleinstances
Q on eachinstance
representationof instances
Q(D1), Q(D2), …, Q(Dn)Q(D1), Q(D2), …, Q(Dn)
D’D’implementation of Q
D + ResultD + Result
![Page 23: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/23.jpg)
2323
Well-Behaved ULDBsWell-Behaved ULDBs
• If we start with a well-behaved ULDB and perform standard queries, it remains well-behaved
• Intuitively (details in paper):– Acyclic: No cycles in the lineage– Deterministic: Non-empty lineages of distinct
alternatives are distinct– Uniform: Alternatives of same tuple are derived
from the same set of tuples
![Page 24: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/24.jpg)
2424
ULDB MinimalityULDB Minimality
• Data-minimality• Does every alternative appear in some
possible instance? (no extraneous alternatives)
• Does every maybe-tuple in R not appear in some possible instance? (no extraneous ‘?’s)
•Lineage-minimality
![Page 25: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/25.jpg)
2525
Data-Minimality ExamplesData-Minimality ExamplesExtraneous ‘?’
. . .
10 (Billy, Honda) ∥ (Frank, Honda)
. . .
. . .
20
Billy ∥ Frank
. . .
?λ(20,1)=(10,1); λ(20,2)=(10,2)
extraneous
![Page 26: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/26.jpg)
2626
Data-Minimality ExamplesData-Minimality ExamplesExtraneous alternative
(Diane, Mazda) ∥ (Diane, Acura)
Dianeextraneous
(Diane, Mazda)
(Diane, Acura)
?
? ?
![Page 27: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/27.jpg)
2727
Data-MinimizationData-Minimization
• Extraneous alternative theorem:– An alternative is extraneous iff it is (possibly
transitively) derived from multiple alternatives of the same tuple.
• Extraneous “?” theorem– A “?” on tuple t is extraneous iff
1. it is derived from base tuples without “?”2. t has as many alternatives as the product of the
number in its base tuples
• Minimization algorithm based on the theorems
(see paper)
![Page 28: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/28.jpg)
2828
Data-minimalLineage-minimal
Lineage-minimal
Data-minimal
Data-minimize
Queries
Lineage-minimize
ULDB Properties and ULDB Properties and OperationsOperations
Mem
bers
hip
Extraction
![Page 29: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/29.jpg)
2929
Membership QuestionsMembership Questions
• Does a given tuple t appear in some (all) possible instance(s) of R ?– Polynomial algorithms based on Data-
minimization
• Is a given table T one of (all of) the possible instances of R ?– NP-Hard
t?, T?
RR
I1, I2, …, InI1, I2, …, In
possibleinstances
![Page 30: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/30.jpg)
3030
ExtractionExtraction
•Extraction algorithm in paper
Suspects Eats
Saw Drives
![Page 31: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/31.jpg)
3131
Outline of the TalkOutline of the Talk
•The ULDB data model
•Querying ULDBs
•ULDB properties
•Membership and extraction operations
•Confidences
•Current, related, and future work
![Page 32: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/32.jpg)
3232
ConfidencesConfidences
• Confidences supplied with base data
• Trio computes confidences on query results– Default probabilistic interpretation– Can choose to plug in different arithmetic
Saw (witness,car)
(Cathy, Honda): 0.6 ∥ (Cathy, Mazda): 0.4
Drives (person,car)
(Jimmy, Mazda): 0.3 ∥ (Bill, Mazda): 0.6
(Hank, Honda): 1.0
Suspects
Jimmy: 0.12 ∥ Bill: 0.24
Hank: 0.6
?
?
?
0.3 0.4
0.6
ProbabilisticProbabilistic Min
![Page 33: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/33.jpg)
3333
Query Processing with Query Processing with ConfidencesConfidences• Previous approach (probabilistic
databases)– Each operator computes confidences during
query execution– Only certain query plans allowed
• In ULDBs– Confidence of alternative A is function of
confidences in its transitive lineage
• Our approach: Decouple data and confidence computation– Use any query plan for data computation– Compute confidences on-demand using
lineage– Can give arbitrarily large improvements
![Page 34: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/34.jpg)
3434
Current Work: AlgorithmsCurrent Work: Algorithms
•Algorithms: confidence computation, extraneous data, membership questions– Minimize lineage traversal– Memoization– Batch computations
![Page 35: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/35.jpg)
3535
The Trio TrioThe Trio Trio
• Data Model– ULDBs (Coming: incomplete relations;
continuous uncertainty; correlation uncertainty)
• Query Language– Simple extension to SQL– Query uncertainty, confidences, and lineage
• System– Did you see our demo? – Version 1: Entirely on top of conventional DBMS– Surprisingly easy and complete, reasonably
efficient
TriQL
![Page 36: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/36.jpg)
3636
Brief Related WorkBrief Related Work
•Uncertainty– Modeling
•C-tables [IL84], Probabilistic Databases [CP87], using Nested Relations [F90]
– Systems•ProbView [LLRS97], MYSTIQ [BDM+05],
ORION [CSP05], Trio [BDHW05]
•Lineage– DBNotes [CTV05], Data Warehouses
[CW03]
![Page 37: Uncertainty Lineage Data Bases Very Large Data Bases 19752006](https://reader035.vdocument.in/reader035/viewer/2022062320/56649d055503460f949d9240/html5/thumbnails/37.jpg)
but don’t forgetthe lineage…
Thank YouThank You
Search “stanford trio”
(or, http://i.stanford.edu/trio)