rapid, collaborative sharing of dynamic data

Rapid, Collaborative Sharing of Dynamic Data

with Nicholas Taylor, T. J. Green, Grigoris Karvounarakis, Val Tannen

North Carolina State UniversityOctober 6, 2006

Zachary G. IvesUniversity of Pennsylvania

Funded by NSF IIS-0477972, IIS-0513778

http://www.upenn.edu/webguide/style_guide/logo/protected/shield.color.eps.zip

An Elusive Goal: Building a Web of Structured Data

A longtime goal of the computer science field: creating a “smarter” Web e.g., Tim Berners-Lee’s “Semantic Web”, 15

years of Web data integration

Envisioned capabilities: Link and correlate data from different sources

to answer questions that need semantics Provide a convenient means of exchanging

data with business partners, collaborators, etc.

Why Is this So Hard?

Semantics is a fuzzy concept Different terminology, units, or ways of representing things

e.g., in real estate, “full + half baths” vs. “bathrooms” Difficult to determine and specify equivalences

e.g., conference paper vs. publication – how do they relate precisely?

Linking isn’t simply a matter of hyperlinking and counting on a human

Instead we need to develop and specify mappings (converters, synonyms)

Real data is messy, uncertain, inconsistent Typos; uncertain data; non-canonical names; data that doesn’t

fit into a standard form/schema

But (we believe): the data sharing architecture is the big bottleneck

Data Sharing, DB-Style:One Instance to Rule Them All?

Data warehouse/exchange: one schema, one consistent instance

Data integration / peer data management systems:• Map heterogeneous data into one or a few virtual

schemas• Remove any data that’s inconsistent [Arenas+]

Source Catalog

Mediated Schema

Data Integration System

}}

Autonomous Data Sources

Schema Mappings

Query Results

A Common Need: Partial, Peer-to-Peer Collaborative Data Exchange

Sometimes we need to exchange data in a less rigid fashion… Cell phone directory with our friend’s – with different

nicknames Citation DBs with different conference abbreviations Restaurant reviews and ratings Scientific databases, where inconsistency or uncertainty are

common

“Peer to peer” in that no one DB is all-encompassing or authoritative Participation is totally voluntary, must not impede local work Each must be able to override or supplement data from

elsewhere

Target Domain: Data Exchange among Bioinformatics DBs & Biologists

Bioinformatics groups and biologists want to share data in their databases and warehouses

Data overlaps – some DBs are specialized, others general (but with data that is less validated)

Each source is updated, curated locally Updates are published periodically

We are providing mechanisms to: Support local queries and edits to data in each DB Allow on-demand publishing of updates made to the local DB Import others’ updates to each local DB despite different

schemas Accommodate the fact that not all sites agree on the edits!

(Not probabilistic – sometimes, no consensus on the “right” answer!)

CryptoDB

PlasmoDB

EBI

Challenges

Multi-“everything”: Multiple schemas, multiple peers with instances, multiple

possibilities for consistent overall instances

Voluntary participation: Group may publish infrequently, drop off the network, etc. Inconsistency with “the rest of the world” must not

prevent the user from doing an operationUnlike cvs or distributed DBs, where consistency with everyone else is always enforced

Conflicts need to be captured at the right granularity: Tuples aren’t added independently – they are generally

part of transactions and that may have causal dependencies

Collaborative Data Sharing

Philosophy: rather than enforcing a global instance, support many overlapping instances in many schemas

(Conflicts are localized!)

Collaborative Data Sharing System (CDSS): Accommodate disagreement with an extended data model

Track provenance and support trust policies “Reconcile” databases by sharing transactions

Detect conflicts via constraints and incompatible updates Define update translation mappings to get all data into the

target schemaBased on schema mappings and provenance

We are implementing the ORCHESTRA CDSS

A Peer’s Perspective of the CDSS

User interacts with standard database CDSS coordinates with other participants

Ensures availability of published updates Finds consistent set of trusted updates (reconciliation)

Updates may first need to be mapped into the target schema

Participant PC

RDBMS

Updates (C) CDSS (ORCHESTRA

)

PE

PP

E

C

Consistent, trusted subset of E, in Pc’s schema

Queries andAnswers

A CDSS Maps among Sources that Each Publish Updates in Transactions

RC (GUSv3)

RE (MIAME)RP (GUSv1)

CryptoDB (PC)

PlasmoDB (PP) EBI (PE)

CE->P

E<->C

+++

E

-++

C

+-+

P

Along with Schema Mappings, We Add Prioritized Trust Conditions

RC (GUSv3)

RE (MIAME)RP (GUSv1)

CryptoDB (PC)

PlasmoDB (PP)

+++

E

-++

C

+-+

P

EBI (PE)

CE->P

E<->C

Priority 5

if 1

Priority 3

always

Priorit

y 1

always

The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing

1. Accommodate disagreement with an extended data model

2. Reconcile updates at the transaction level

3. Define update translation mappings to get all data into the target schema

Multi-Viewpoint Tables (MVTs):Specialized Conditional Tables + Provenance

A B provenance

viewpoint

trans@time

a b Peer1:tup1

Peer1,Peer2

X1@τ1

c b Peer1:tup5

Peer1 X1@τ1

c d Peer3:tup8

Peer3,Peer4

X2@τ2

GUSv1:Study

AB

Each peer’s instance is the subset of tuples in which the peer’s name appears in the viewpoint set

Reconciling peer’s trust conditions assign priorities based on data, provenance, viewpoint set:

Peer2:Study(A,B) :- {(GUSv1:Study(A,B; prv, _, _)& contains(prv, Peer1:*); 5),

(GUSv1:Study(A,B; _, vpt, _)& contains(vpt, Peer3); 2)}

Datalog rule body Priority

Summary of MVTs

Allow us to have one representation for disagreeing data instances – necessary for expressing constraints among different data sources

Really, we focus on updates rather than data Relations of deltas (tuple edits), as opposed to

tuples themselves

CDSS Reconciliation [Taylor+Ives SIGMOD06]

Operations are between one participant and “the system” Publishing Reconciliation

Participant applies consistent subset of updates May get its own unique instance

LocalInstance

Participant

ORCHESTRA

System

Publish New Updates

Reconciliation Requests

Published Updates

UpdateLog

request

Challenges of Reconciliation

Updates occur in atomic transactions Transactions have causal dependencies

(antecedents) Peers may participate intermittently

(Requires us to make maximal progress at each step)

Ground Rules of Reconciliation

Clearly, we must not: Apply a transaction without having data it depends on

(i.e., we need its antecedents) Apply a transaction chain that causes constraint violations Apply two transaction chains that affect the same tuple in

incompatible ways

Also, we believe we should: Exhibit consistent, predictable behavior to the user

Monotonic treatment of updates: transaction acceptances are final

Always prefer higher priority transactions Make progress despite conflicts with no clear winner

Allow user conflict resolutions to be deferred

Deferred keys

D

key value

A 4

B 4

C 5

Deferred keys

D

key value

A 4

B 4

Reconciliation in ORCHESTRA

Accept highest priority transactions (and any necessary antecedents)

Reconciliation 1 Reconciliation 2

+(A,4)+(B,4)

+(D,8)

+(D,9)

+(A,3)

+(C,6)

Decision: Accept Reject Defer

R(X,Y): XY

+(A,2)

Txn Priority:

high

medium

low

A

D

+(B,3)+(C,5)

A

B

C

Possible problem: transient conflictsWe flatten chains of antecedent transactions

(C,6) (D,6)

Transaction Chains

+(C,6) +(D,6)Peer 1:

Peer 3: +(C,5)

+(A,2)(D,6) (D,7)

+(A,2)+(D,7)

+(A,1)+(B,4)+(F,4)+(E,5)

(B,3) (B,4)(C,5) (E,5)

Flattening and Antecedents

Decision: Accept Reject Defer

R(X,Y): XY

Txn Priority:

high

medium

low

+(D,6)

+(A,1)+(B,3)+(F,4)

+(C,5)

A

+(C,5)

+(D,6)

+(A,1)+(B,3)+(F,4)

Reconciliation Algorithm:Greedy, Hence Efficient

Input: Flattened trusted applicable transaction chainsOutput: Set A of accepted transactions:

For each priority p from pmax to 1: Let C be the set of chains for priority p If some t in C conflicts with a non-subsumed u in A,

REJECT t If some t in C

uses a deferred value, DEFER it conflicts with a non-subsumed, non-rejected u in

C, DEFER t Otherwise, ACCEPT t by adding it to A

ORCHESTRA Reconciliation Module

Java reconciliation algorithm at each participant Poly-time in size of update load + antecedent chain length

Distributed update store built upon Pastry + BerkeleyDB Store updates persistently Compute antecedent chains

Participant

Participant

Participant

Participant

Distributed Update Store

Publish New Updates

Reconciliation Requests

Published UpdatesParticipant

RDBMS

ReconciliationAlgorithm

Experimental Highlight: Performance Is Adequate for Periodic Reconciliation

Simulated (Zipfian-skewed) update distribution over subset of SWISS-PROT at each peer (insert / replace workload)10 peers each publish 500 single-update transactions

Infrequent reconciliation more efficient Fetch times (i.e. network latency) dominate

0

5

10

15

20

25

4 20 50 4 20 50

Updates between Reconciliations

To

tal R

eco

n. T

ime

per

P

arti

cip

ant

(sec

)

Algo. TimeStore Time

Centralized Impl. Distributed Impl.

Skewed Updates, Infrequent Changes Don’t Result in Huge Divergence

Effect of reconciliation interval on synchronicity synchronicity = avg. no. of values per key ten peers each publish 500 transactions of one update

Infrequent reconciliation slowly changes synchronicity

0

0.5

1

1.5

2

2.5

0 5 10 15 20

Updates between Reconciliations

Syn

chro

nic

ity

Summary of Reconciliation

Distributed implementation is practical We don’t really need “real-time” updates, and operation

is reasonable (We are currently running 100s of virtual peers)

Many opportunities for query processing research (caching, replication)

Other experiments (in SIGMOD06 paper)How much disagreement arises?

Transactions with > 2 updates have negligible impact Adding more peers has a sublinear effect

Performance with more peersIncreases execution time linearly

Need all of the data in one target schema…

Reconciling with Many Schemas

Reconciliation needs transactions over the target schema1. Break txns into constituent updates (deltas), tagged with txn IDs2. Translate the deltas using schema mappings3. Reassemble transactions by grouping deltas w/ same txn ID4. Reconcile!

Participant PC

RDBMS

Updates (C) CDSS (ORCHESTRA

)

PE

PP

E

C

Consistent, trusted subset of E, in Pc’s schema

Given a Set of Mappings, What Data Should be in Each Peer’s Instance?

RC

RERP

CryptoDB (PC)

CE-> P

E<->C

PP PE

PDMS Semantics [H+03]: each peer provides all certain answers

Schema Mappings from Data Exchange: A Basic Foundation

Data exchange (Clio group at IBM, esp. Popa and Fagin):Schema mappings are tuple generating dependencies (TGDs)

R(x,y), S(y, z) ∃w T(x,w,z), U(z,w,y)

Chase [PT99] over sources, TGDs, to compute target instances

Resulting instance: canonical universal solution [FKMP03], and queries over it give all certain answers

Our setting adds some important twists…

Semantics of Consistency: Input, Edit, and Output Relations

PC

PP

+++

E

-++

C

+-+

P

PE

RPi

RPo

REi

REo

RCo

RCi

CE->P

E ->C

E<- C

Input relation

Edit table (local updates)

Output relation

Incremental Reconciliation in a CDSS [Green, Karvounarakis, Tannen, Ives submission]

Re-compute each peer’s instance individually, in accordance with the input-edit-output model

Don’t re-compute from scratch Translate all “new” updates into the target schema,

maintaining transaction and sequencing info Then perform reconciliation as we described

previously

This problem requires new twists on view maintenance

Mapping Updates: Starting Point

Given schema mappings:R(x,y), S(y, z) ∃w T(x,z), U(z,w,y)

Convert these into update translation mappings that convert “deltas” over relations(Similar to [GM95] count algorithm’s rules)

-R(x,y), S(y, z) ∃w -T(x,z), -U(z,w,y)R(x,y), -S(y, z) ∃w -T(x,z), -U(z,w,y)-R(x,y), -S(y, z) ∃w -T(x,z), -U(z,w,y)

A Wrinkle: Incremental Deletion

Suppose our mapping is R(x,y) S(x)And we are given:

R(x,y) 1,2 1,3 2,4

S(x) 1 2

Then:

A Wrinkle: Incremental Deletion

We want a deletion rule like: -R(x,y) -S(x)But this doesn’t quite work:

If we delete R(1,2), then S should be unaffected If we map –R(1,2) to –S(1), we can’t delete S(1) yet…

Only if we also delete R(1,3) should we delete S(1) Source of the problem is that S(1) has several

distinct derivations! (Similar to bag semantics)

R(x,y) 1,2 1,3 2,4

S(x) 1 2

A First Try… Counting [GM95]

(Gupta and Mumick’s counting algorithm) When computing s, add a count for #

derivations:

When we use -R(x,y) -S(x), for each deletion, decrement the count, and only remove when we get to 0

R(x,y) 1,2 1,3 2,4

S(x) c 1 2 2 1

10

Where this Fails…

Suppose we have a cyclic definition (two peers want to exchange data):M1: R(x,y) S(y,x) M2: S(x,y) R(x,y)

How many times is each tuple derived?We need a finite fixpoint, or else this isn’t implementable!What happens if R deletes the tuple? If S does? …

R 1,2 2,4

S 2,1 4,2

R 1,2 2,4 2,1 4,2

S 2,1 4,2 1,2 2,4

…

M1 M1M2 M2 R 1,2 2,4 2,1 4,2

Desiderata for a Solution

Record a trace of each distinct derivation of a tuple, w.r.t. its original relation and every mapping Different from, e.g., Cui & Widom’s provenance traces,

which only maintain source info

In cyclic cases, only count “identical loops” a finite number of times (say once) This gives us a least fixpoint, in terms of tuples and their

derivations … It also requires a non-obvious solution, since we can’t use

sets, trees, etc. to define provenance

… An idea: think of the derivation as being analogous to a recurrence relation…

Our Approach: S-Tables

Trace tuple provenance as a semiring polynomial (S,+,*,0,1), to which we add mapping M(…):

x + 0 = x x + x = x x * 0 = 0(x+y)+z = x+(y+z) (x*y)*z = x*(y*z) x +

y = y + xx(y + z) = xy + xz M(x + y) = M(x)

+ M(y)Tuple with provenance of 0 is considered to not be part

of instance

M1: R(x,y) S(y,x) M2: S(x,y) R(x,y)

R Provenance 1,2 p0 = t0 2,4 p1 = t1 2,1 p2 = M2(p4) 4,2 p3 = M2(p5)

S ProvenanceR Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5)

S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1)

S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3)

Incremental Insertion with S-tables

Inserting a tuple t: If there’s already an identical tuple t’, update

the provenance of t to t’ to be prov(t) + prov(t’) Then, simplify – note result may be no change!

Else insert t with its provenance

Deletion

M1: R(x,y) S(y,x) M2: S(x,y) R(x,y)

R Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5)

S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3)

Given –R(1,2) and –S(2,4)M1: -R(x,y) -S(y,x) M2: -S(x,y) -R(x,y)

1. Set p0 and p7 := 02. Simplify (may be nontrivial if mutual recursion)

R Provenance 2,4 p1 = t1 4,2 p3 = M2(p5)

S Provenance 4,2 p5 = M1(p1)

Summary: S-Tables and Provenance

More expressive than “why & where provenance,” [Buneman+ 01], lineage tracing [Cui + Widom 01], other formalisms Similar in spirit to “mapping routes” [Chiticaru+06], irrelevant

rule elimination [Levy+ 92]

If the set of mappings has a least fixpoint in datalog, it has one in our semantics Our polynomial captures all possible derivation paths “through

the mappings” – a form of “how provenance” (Tannen)

Gives us a means of performing incremental maintenance in a fully P2P model, even with cycles (that have least fixpoints)

Ongoing Work

Implementing the provenance-based maintenance algorithm Procedure can be cast as a set of datalog rules But: needs “slightly more” than SQL or stratified datalog

semantics

Inverse mappings We propagate updates “down” a mapping – what about

upwards? Necessary to support mirroring… Provenance makes it quite different from existing view update

literature

Performance! Lots of opportunities for caching antecedents, reusing computations across reconciliations, answering queries using views, multi-query optimization!

SHARQ [with Davidson, Tannen, Stoeckert, White]

ORCHESTRA is the core engine of a larger effort in bioinformatics information management: SHARQ (Sharing Heterogeneous, Autonomous Resources and

Queries) Develop a network of database instances, views, query forms, etc.

that: Is incrementally extensible with new data, views, query templates Supports search for the “the right” query form to answer a

question Accommodates a variety of different sub-communities Supports both browsing and searching modes of operation … Perhaps even supports text extraction and approximate

matches

SHAR

Related Work Incomplete information [Imielinski & Lipski 84], info source tracking

[Sadri 98] Inconsistency repair [Bry 97], [Arenas+99] Provenance [Alagar+ 95][Cui & Widom 01][Buneman+ 01][Widom+

05] Distributed concurrency control

Optimistic CC [KR 81], Version vectors [PPR+83], … View update [Dayal & Bernstein 82][Keller 84, 85], … Incremental maintenance [Gupta & Mumick 95], [Blakeley 86, 89],

… File synchronization and distributed filesystems

Harmony [Foster + 04], Unison [Pierce + 01]; CVS, Subversion, etc. Ivy [MMGC 02], Coda [Braam 98,KS 95], Bayou [TTP+’96], …

Treo [Widom+], MystiQ [Suciu+] Peer data management systems

Piazza [Halevy + 03, 04], Hyperion [Kementsietsidis+ 04], [Calvanese+ 04], peer data exchange [Fuxman + 05], Trento/Toronto LRM [Bernstein+ 02]

Conclusions

ORCHESTRA focuses on trying to coordinate disagreement, rather than enforcing agreement

1. Accommodate disagreement with an extended data model and trust policies

2. Reconcile updates at the transaction level3. Define update translation mappings to get all

data into the target schema

Ongoing work: implementing update mappings, caching, replication, biological applications

rapid, collaborative sharing of dynamic data

Documents

correlate data

synonymsreal data

supplement data

data warehouseexchange

data sharing architecture

map heterogeneous data

local db import

different sources