trio: a system for data, uncertainty, and...

36
Trio: A System for Data, Uncertainty, and Lineage Jennifer Widom

Upload: others

Post on 13-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

Trio: A System for Data, Uncertainty, and Lineage

Jennifer Widom

Page 2: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

2

Outline of Talk

1. The Motivation

2. The Discovery

3. The Vision

4. The Present

5. The Future

Page 3: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

3

The Motivation

• Lots of applications have uncertain data(approximate, incomplete, imprecise, inaccurate, ...)

• Lots of the same applications need to track data lineage

• Neither is supported by conventional Database Management Systems (DBMSs)

Coincidence or Fate?

Page 4: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

4

Applications

Deduplication• Uncertainty: Match and merge• Lineage: Source records

Information extraction• Uncertainty: Extracted labels and values• Lineage: Original context

Information integration• Uncertainty: Inconsistent information• Lineage: Original sources

Page 5: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

5

Applications

Scientific experiments• Uncertainty: Captured (and derived) data

• Lineage: Layers of views

Sensor data• Uncertainty: Sensor values, missing readings

• Lineage: Original readings, views

Page 6: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

6

The Discovery

The connection between uncertainty and lineage goes deeper than just a shared need by several applications

Coincidence or Fate?

Page 7: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

7

Lineage and Uncertainty

Lineage...• Enables simple and consistent representation of

uncertain data

• Correlates uncertainty in query results with uncertainty in the input data

• Can make computation over uncertain data more efficient

Applications use lineage to reduce or resolve uncertainty

Page 8: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

8

The Vision

A new kind of DBMS in which:1. Data2. Uncertainty3. Lineage

are all first-class interrelated concepts

Trio

Page 9: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

9

The Trio Trio

1. Data ModelSimplest extension to relational model that’s sufficiently expressive

2. Query LanguageSimple extension to SQL with well-defined semantics and intuitive behavior

3. SystemA complete open-source DBMS that people want to use

Page 10: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

10

The Present

1. Data ModelUncertainty-Lineage Databases (ULDBs)

2. Query LanguageTriQL

3. SystemFirst prototype built on top of standard DBMS

Page 11: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

11

Running Example: Crime-Solving

Saw(witness,car) // may be uncertain

Owns(owner,car) // may be uncertain

Suspects(person) = πowner(Saw ⋈ Owns)

Page 12: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

12

Data Model: Uncertainty

An uncertain database represents a set ofpossible instances

• Amy saw either a Honda or a Toyota

• Jimmy owns a Toyota, a Mazda, or both

• Betty saw an Acura with confidence 0.5 or a Toyota with confidence 0.3

• Hank is a suspect with confidence 0.7

Page 13: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

13

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

Page 14: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

14

Our Model for Uncertainty

1. Alternatives: uncertainty about value

2. ‘?’ (Maybe) Annotations

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

witness car

Amy { Honda, Toyota, Mazda }=

Three possible instances

Page 15: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

15

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe): uncertainty about existence

3. Confidences

Saw (witness,car)

(Amy, Honda) ∥ (Amy, Toyota) ∥ (Amy, Mazda)

(Betty, Acura)?

Six possible instances

Page 16: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

16

Our Model for Uncertainty

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences: weighted uncertainty

Saw (witness,car)

(Amy, Honda): 0.5 ∥ (Amy,Toyota): 0.3 ∥ (Amy, Mazda): 0.2

(Betty, Acura): 0.6?

Six possible instances,each with a probability

Page 17: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

17

Models for Uncertainty

• Our model (so far) is not especially new

• We spent some time exploring the space of models for uncertainty [two papers]

• Tension between understandability and expressiveness– Our model is understandable

– But it is not complete, or even closed under common operations

Page 18: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

18

Closure and Completeness

CompletenessCan represent all sets of possible instances

ClosureCan represent results of operations

Note: Completeness ⇒ Closure

Page 19: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

19

Our Model is Not Closed

Saw (witness,car)

(Cathy, Honda) ∥ (Cathy, Mazda)

Owns (owner,car)

(Jimmy, Toyota) ∥ (Jimmy, Mazda)

(Billy, Honda)

(Hank, Honda)

Suspects

Jimmy

Billy

Hank

Suspects = πowner(Saw ⋈ Owns)

???

Does not correctlycapture possibleinstances in theresult

Cannot

Page 20: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

20

to the Rescue

Lineage (provenance): “where data came from”• Internal lineage

• External lineage

In Trio: A function λ from alternatives to other alternatives (or external sources)

Lineage

Page 21: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

21

Example with Lineage

ID Saw (witness,car)

11 (Cathy, Honda) ∥ (Cathy, Mazda)

ID Owns (owner,car)

21 (Jimmy, Toyota) ∥ (Jimmy, Mazda)

22 (Billy, Honda)

23 (Hank, Honda)

ID Suspects

31 Jimmy

32 Billy

33 Hank

???

Suspects = πowner(Saw ⋈ Owns)

λ(31) = (11,2),(21,2)λ(32) = (11,1), 22λ(33) = (11,1), 23

Correctlycaptures possibleinstances in theresult

Page 22: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

22

Trio Data Model

[recent paper]

1. Alternatives

2. ‘?’ (Maybe) Annotations

3. Confidences

4. Lineage

ULDBs are closed and complete

Uncertainty-Lineage Databases (ULDBs)

Page 23: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

23

ULDB Results

Conjunctive lineage sufficient for most operations• Negative lineage for difference

• Disjunctive lineage for duplicate-elimination

Minimality of representations• Data-minimal

• Lineage-minimal

Membership problems

Extraction of a relation from a ULDB

Page 24: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

24

Querying ULDBs

• Simple extension to SQL

• Formal semantics, intuitive meaning

• Ability to query confidences and lineage directly

TriQL

Page 25: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

25

TriQL Example

ID Saw (witness,car)

11 (Cathy, Honda) ∥ (Cathy, Mazda)

ID Owns (owner,car)

21 (Jimmy, Toyota) ∥ (Jimmy, Mazda)

22 (Billy, Honda)

23 (Hank, Honda)

ID person

31 Jimmy

32 Billy

33 Hank

???

SELECT Owns.person INTO SuspectsFROM Saw, OwnsWHERE Saw.car = Owns.car

λ(31) = (11,2),(21,2)λ(32) = (11,1), 22λ(33) = (11,1), 23

Page 26: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

26

Formal Semantics

Query Q on ULDB D

D

D1, D2, …, Dn

possibleinstances

Q on eachinstance

representationof instances

Q(D1), Q(D2), …, Q(Dn)

D’implementation of Q

operational semantics

D + Result

Page 27: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

27

TriQL: Querying Confidences

Built-in function: conf()

SELECT Owns.person INTO SuspectsFROM Saw, OwnsWHERE Saw.car = Owns.carAND conf(Saw) > 0.5 AND conf(Owns) > 0.8

Page 28: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

28

TriQL: Querying Lineage

Built-in join predicate: lineage()

SELECT Saw.witness INTO AccusesHankFROM Suspects, SawWHERE lineage(Suspects,Saw)AND Suspects.person = ‘Hank’

Also lineage*()

Page 29: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

29

Computing Confidences

Previous approach (probabilistic databases):• Each operator computes confidences during query

execution

• Only certain query plans allowed

Our approach• Use any query plan

• Compute confidences afterwards based on lineage

Page 30: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

30

The Trio System

Version 1Entirely on top of conventional DBMS

Surprisingly easy and complete, reasonably efficient

Page 31: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

31

The Trio System: Version 1

Lin:R aid table aid

R xid aid C

Relational DBMS

create trio table T(A,B)

select C into R ...

TrioMetadata

Trio APISQL commands

• Result cursors• Browse tables• Explore lineage

T xid aid A B

Page 32: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

32

The Trio System: Version 1

Relational DBMS

create trio table T(A,B)

select C into R ...

Trio API

• Result cursors• Browse tables• Explore lineage

Command-line client

Lin:R aid table aid

R xid aid CTrio

Metadata

T xid aid A B

Page 33: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

33

The Trio System: Version 2

Relational DBMS

create trio table T(A,B)

select C into R ...

Trio API

• Result cursors• Browse tables• Explore lineage

GUI client

SpecializedTrio Processing

SpecializedTrio Structures Trio

Metadata

Page 34: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

34

Current Topics

Confidence computation• Minimize lineage traversal; memoization; batch

computations

Updates• Primitive operations; TriQL update statements

Additional query constructs• “Horizontal” operators; top-k by confidence

System• Keep up with research; GUI

Page 35: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

35

Future Directions

Theory, Model, Algorithms• Unlimited opportunities

System• Storage, indexing, partitioning• Statistics and query optimization

Long Range• Continuous uncertainty; incomplete relations• External lineage; versioning

Page 36: Trio: A System for Data, Uncertainty, and Lineageforum.stanford.edu/events/2006/2006slides/infolab/widom.pdf · 3. Lineage are all first-class interrelated concepts. Trio . 9. The

Search “stanford trio”[overview paper]

Trio group:Parag Agrawal, Omar Benjelloun, Anish Das Sarma,

Chris Hayworth, Shubha Nabar, Jennifer Widom

Special thanks to:Ashok Chandra, Alon Halevy, Jeff Ullman

but don’t forgetthe lineage…