rapid, collaborative sharing of dynamic data
DESCRIPTION
Rapid, Collaborative Sharing of Dynamic Data. Zachary G. Ives University of Pennsylvania. with Nicholas Taylor, T. J. Green, Grigoris Karvounarakis, Val Tannen North Carolina State University October 6, 2006. Funded by NSF IIS-0477972, IIS-0513778. - PowerPoint PPT PresentationTRANSCRIPT
Rapid, Collaborative Sharing of Dynamic Data
with Nicholas Taylor, T. J. Green, Grigoris Karvounarakis, Val Tannen
North Carolina State UniversityOctober 6, 2006
Zachary G. IvesUniversity of Pennsylvania
Funded by NSF IIS-0477972, IIS-0513778
An Elusive Goal: Building a Web of Structured Data
A longtime goal of the computer science field: creating a “smarter” Web e.g., Tim Berners-Lee’s “Semantic Web”, 15
years of Web data integration
Envisioned capabilities: Link and correlate data from different sources
to answer questions that need semantics Provide a convenient means of exchanging
data with business partners, collaborators, etc.
Why Is this So Hard?
Semantics is a fuzzy concept Different terminology, units, or ways of representing things
e.g., in real estate, “full + half baths” vs. “bathrooms” Difficult to determine and specify equivalences
e.g., conference paper vs. publication – how do they relate precisely?
Linking isn’t simply a matter of hyperlinking and counting on a human
Instead we need to develop and specify mappings (converters, synonyms)
Real data is messy, uncertain, inconsistent Typos; uncertain data; non-canonical names; data that doesn’t
fit into a standard form/schema
But (we believe): the data sharing architecture is the big bottleneck
Data Sharing, DB-Style:One Instance to Rule Them All?
Data warehouse/exchange: one schema, one consistent instance
Data integration / peer data management systems:• Map heterogeneous data into one or a few virtual
schemas• Remove any data that’s inconsistent [Arenas+]
Source Catalog
Mediated Schema
Data Integration System
}}
Autonomous Data Sources
Schema Mappings
Query Results
A Common Need: Partial, Peer-to-Peer Collaborative Data Exchange
Sometimes we need to exchange data in a less rigid fashion… Cell phone directory with our friend’s – with different
nicknames Citation DBs with different conference abbreviations Restaurant reviews and ratings Scientific databases, where inconsistency or uncertainty are
common
“Peer to peer” in that no one DB is all-encompassing or authoritative Participation is totally voluntary, must not impede local work Each must be able to override or supplement data from
elsewhere
Target Domain: Data Exchange among Bioinformatics DBs & Biologists
Bioinformatics groups and biologists want to share data in their databases and warehouses
Data overlaps – some DBs are specialized, others general (but with data that is less validated)
Each source is updated, curated locally Updates are published periodically
We are providing mechanisms to: Support local queries and edits to data in each DB Allow on-demand publishing of updates made to the local DB Import others’ updates to each local DB despite different
schemas Accommodate the fact that not all sites agree on the edits!
(Not probabilistic – sometimes, no consensus on the “right” answer!)
CryptoDB
PlasmoDB
EBI
Challenges
Multi-“everything”: Multiple schemas, multiple peers with instances, multiple
possibilities for consistent overall instances
Voluntary participation: Group may publish infrequently, drop off the network, etc. Inconsistency with “the rest of the world” must not
prevent the user from doing an operationUnlike cvs or distributed DBs, where consistency with everyone else is always enforced
Conflicts need to be captured at the right granularity: Tuples aren’t added independently – they are generally
part of transactions and that may have causal dependencies
Collaborative Data Sharing
Philosophy: rather than enforcing a global instance, support many overlapping instances in many schemas
(Conflicts are localized!)
Collaborative Data Sharing System (CDSS): Accommodate disagreement with an extended data model
Track provenance and support trust policies “Reconcile” databases by sharing transactions
Detect conflicts via constraints and incompatible updates Define update translation mappings to get all data into the
target schemaBased on schema mappings and provenance
We are implementing the ORCHESTRA CDSS
A Peer’s Perspective of the CDSS
User interacts with standard database CDSS coordinates with other participants
Ensures availability of published updates Finds consistent set of trusted updates (reconciliation)
Updates may first need to be mapped into the target schema
Participant PC
RDBMS
Updates (C) CDSS (ORCHESTRA
)
PE
PP
E
C
Consistent, trusted subset of E, in Pc’s schema
Queries andAnswers
A CDSS Maps among Sources that Each Publish Updates in Transactions
RC (GUSv3)
RE (MIAME)RP (GUSv1)
CryptoDB (PC)
PlasmoDB (PP) EBI (PE)
CE->P
E<->C
+++
E
-++
C
+-+
P
Along with Schema Mappings, We Add Prioritized Trust Conditions
RC (GUSv3)
RE (MIAME)RP (GUSv1)
CryptoDB (PC)
PlasmoDB (PP)
+++
E
-++
C
+-+
P
EBI (PE)
CE->P
E<->C
Priority 5
if 1
Priority 3
always
Priorit
y 1
always
The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing
1. Accommodate disagreement with an extended data model
2. Reconcile updates at the transaction level
3. Define update translation mappings to get all data into the target schema
Multi-Viewpoint Tables (MVTs):Specialized Conditional Tables + Provenance
A B provenance
viewpoint
trans@time
a b Peer1:tup1
Peer1,Peer2
X1@τ1
c b Peer1:tup5
Peer1 X1@τ1
c d Peer3:tup8
Peer3,Peer4
X2@τ2
GUSv1:Study
AB
Each peer’s instance is the subset of tuples in which the peer’s name appears in the viewpoint set
Reconciling peer’s trust conditions assign priorities based on data, provenance, viewpoint set:
Peer2:Study(A,B) :- {(GUSv1:Study(A,B; prv, _, _)& contains(prv, Peer1:*); 5),
(GUSv1:Study(A,B; _, vpt, _)& contains(vpt, Peer3); 2)}
Datalog rule body Priority
Summary of MVTs
Allow us to have one representation for disagreeing data instances – necessary for expressing constraints among different data sources
Really, we focus on updates rather than data Relations of deltas (tuple edits), as opposed to
tuples themselves
The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing
1. Accommodate disagreement with an extended data model
2. Reconcile updates at the transaction level
3. Define update translation mappings to get all data into the target schema
CDSS Reconciliation [Taylor+Ives SIGMOD06]
Operations are between one participant and “the system” Publishing Reconciliation
Participant applies consistent subset of updates May get its own unique instance
LocalInstance
Participant
ORCHESTRA
System
Publish New Updates
Reconciliation Requests
Published Updates
UpdateLog
request
Challenges of Reconciliation
Updates occur in atomic transactions Transactions have causal dependencies
(antecedents) Peers may participate intermittently
(Requires us to make maximal progress at each step)
Ground Rules of Reconciliation
Clearly, we must not: Apply a transaction without having data it depends on
(i.e., we need its antecedents) Apply a transaction chain that causes constraint violations Apply two transaction chains that affect the same tuple in
incompatible ways
Also, we believe we should: Exhibit consistent, predictable behavior to the user
Monotonic treatment of updates: transaction acceptances are final
Always prefer higher priority transactions Make progress despite conflicts with no clear winner
Allow user conflict resolutions to be deferred
Deferred keys
D
key value
A 4
B 4
C 5
Deferred keys
D
key value
A 4
B 4
Reconciliation in ORCHESTRA
Accept highest priority transactions (and any necessary antecedents)
Reconciliation 1 Reconciliation 2
+(A,4)+(B,4)
+(D,8)
+(D,9)
+(A,3)
+(C,6)
Decision: Accept Reject Defer
R(X,Y): XY
+(A,2)
Txn Priority:
high
medium
low
A
D
+(B,3)+(C,5)
A
B
C
Possible problem: transient conflictsWe flatten chains of antecedent transactions
(C,6) (D,6)
Transaction Chains
+(C,6) +(D,6)Peer 1:
Peer 3: +(C,5)
+(A,2)(D,6) (D,7)
+(A,2)+(D,7)
+(A,1)+(B,4)+(F,4)+(E,5)
(B,3) (B,4)(C,5) (E,5)
Flattening and Antecedents
Decision: Accept Reject Defer
R(X,Y): XY
Txn Priority:
high
medium
low
+(D,6)
+(A,1)+(B,3)+(F,4)
+(C,5)
A
+(C,5)
+(D,6)
+(A,1)+(B,3)+(F,4)
Reconciliation Algorithm:Greedy, Hence Efficient
Input: Flattened trusted applicable transaction chainsOutput: Set A of accepted transactions:
For each priority p from pmax to 1: Let C be the set of chains for priority p If some t in C conflicts with a non-subsumed u in A,
REJECT t If some t in C
uses a deferred value, DEFER it conflicts with a non-subsumed, non-rejected u in
C, DEFER t Otherwise, ACCEPT t by adding it to A
ORCHESTRA Reconciliation Module
Java reconciliation algorithm at each participant Poly-time in size of update load + antecedent chain length
Distributed update store built upon Pastry + BerkeleyDB Store updates persistently Compute antecedent chains
Participant
Participant
Participant
Participant
Distributed Update Store
Publish New Updates
Reconciliation Requests
Published UpdatesParticipant
RDBMS
ReconciliationAlgorithm
Experimental Highlight: Performance Is Adequate for Periodic Reconciliation
Simulated (Zipfian-skewed) update distribution over subset of SWISS-PROT at each peer (insert / replace workload)10 peers each publish 500 single-update transactions
Infrequent reconciliation more efficient Fetch times (i.e. network latency) dominate
0
5
10
15
20
25
4 20 50 4 20 50
Updates between Reconciliations
To
tal R
eco
n. T
ime
per
P
arti
cip
ant
(sec
)
Algo. TimeStore Time
Centralized Impl. Distributed Impl.
Skewed Updates, Infrequent Changes Don’t Result in Huge Divergence
Effect of reconciliation interval on synchronicity synchronicity = avg. no. of values per key ten peers each publish 500 transactions of one update
Infrequent reconciliation slowly changes synchronicity
0
0.5
1
1.5
2
2.5
0 5 10 15 20
Updates between Reconciliations
Syn
chro
nic
ity
Summary of Reconciliation
Distributed implementation is practical We don’t really need “real-time” updates, and operation
is reasonable (We are currently running 100s of virtual peers)
Many opportunities for query processing research (caching, replication)
Other experiments (in SIGMOD06 paper)How much disagreement arises?
Transactions with > 2 updates have negligible impact Adding more peers has a sublinear effect
Performance with more peersIncreases execution time linearly
Need all of the data in one target schema…
The ORCHESTRA Approach to the Challenges of Collaborative Data Sharing
1. Accommodate disagreement with an extended data model
2. Reconcile updates at the transaction level
3. Define update translation mappings to get all data into the target schema
Reconciling with Many Schemas
Reconciliation needs transactions over the target schema1. Break txns into constituent updates (deltas), tagged with txn IDs2. Translate the deltas using schema mappings3. Reassemble transactions by grouping deltas w/ same txn ID4. Reconcile!
Participant PC
RDBMS
Updates (C) CDSS (ORCHESTRA
)
PE
PP
E
C
Consistent, trusted subset of E, in Pc’s schema
Given a Set of Mappings, What Data Should be in Each Peer’s Instance?
RC
RERP
CryptoDB (PC)
CE-> P
E<->C
PP PE
PDMS Semantics [H+03]: each peer provides all certain answers
Schema Mappings from Data Exchange: A Basic Foundation
Data exchange (Clio group at IBM, esp. Popa and Fagin):Schema mappings are tuple generating dependencies (TGDs)
R(x,y), S(y, z) ∃w T(x,w,z), U(z,w,y)
Chase [PT99] over sources, TGDs, to compute target instances
Resulting instance: canonical universal solution [FKMP03], and queries over it give all certain answers
Our setting adds some important twists…
Semantics of Consistency: Input, Edit, and Output Relations
PC
PP
+++
E
-++
C
+-+
P
PE
RPi
RPo
REi
REo
RCo
RCi
CE->P
E ->C
E<- C
Input relation
Edit table (local updates)
Output relation
Incremental Reconciliation in a CDSS [Green, Karvounarakis, Tannen, Ives submission]
Re-compute each peer’s instance individually, in accordance with the input-edit-output model
Don’t re-compute from scratch Translate all “new” updates into the target schema,
maintaining transaction and sequencing info Then perform reconciliation as we described
previously
This problem requires new twists on view maintenance
Mapping Updates: Starting Point
Given schema mappings:R(x,y), S(y, z) ∃w T(x,z), U(z,w,y)
Convert these into update translation mappings that convert “deltas” over relations(Similar to [GM95] count algorithm’s rules)
-R(x,y), S(y, z) ∃w -T(x,z), -U(z,w,y)R(x,y), -S(y, z) ∃w -T(x,z), -U(z,w,y)-R(x,y), -S(y, z) ∃w -T(x,z), -U(z,w,y)
A Wrinkle: Incremental Deletion
Suppose our mapping is R(x,y) S(x)And we are given:
R(x,y) 1,2 1,3 2,4
S(x) 1 2
Then:
A Wrinkle: Incremental Deletion
We want a deletion rule like: -R(x,y) -S(x)But this doesn’t quite work:
If we delete R(1,2), then S should be unaffected If we map –R(1,2) to –S(1), we can’t delete S(1) yet…
Only if we also delete R(1,3) should we delete S(1) Source of the problem is that S(1) has several
distinct derivations! (Similar to bag semantics)
R(x,y) 1,2 1,3 2,4
S(x) 1 2
A First Try… Counting [GM95]
(Gupta and Mumick’s counting algorithm) When computing s, add a count for #
derivations:
When we use -R(x,y) -S(x), for each deletion, decrement the count, and only remove when we get to 0
R(x,y) 1,2 1,3 2,4
S(x) c 1 2 2 1
10
Where this Fails…
Suppose we have a cyclic definition (two peers want to exchange data):M1: R(x,y) S(y,x) M2: S(x,y) R(x,y)
How many times is each tuple derived?We need a finite fixpoint, or else this isn’t implementable!What happens if R deletes the tuple? If S does? …
R 1,2 2,4
S 2,1 4,2
R 1,2 2,4 2,1 4,2
S 2,1 4,2 1,2 2,4
…
M1 M1M2 M2 R 1,2 2,4 2,1 4,2
Desiderata for a Solution
Record a trace of each distinct derivation of a tuple, w.r.t. its original relation and every mapping Different from, e.g., Cui & Widom’s provenance traces,
which only maintain source info
In cyclic cases, only count “identical loops” a finite number of times (say once) This gives us a least fixpoint, in terms of tuples and their
derivations … It also requires a non-obvious solution, since we can’t use
sets, trees, etc. to define provenance
… An idea: think of the derivation as being analogous to a recurrence relation…
Our Approach: S-Tables
Trace tuple provenance as a semiring polynomial (S,+,*,0,1), to which we add mapping M(…):
x + 0 = x x + x = x x * 0 = 0(x+y)+z = x+(y+z) (x*y)*z = x*(y*z) x +
y = y + xx(y + z) = xy + xz M(x + y) = M(x)
+ M(y)Tuple with provenance of 0 is considered to not be part
of instance
M1: R(x,y) S(y,x) M2: S(x,y) R(x,y)
R Provenance 1,2 p0 = t0 2,4 p1 = t1 2,1 p2 = M2(p4) 4,2 p3 = M2(p5)
S ProvenanceR Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5)
S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1)
S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3)
Incremental Insertion with S-tables
Inserting a tuple t: If there’s already an identical tuple t’, update
the provenance of t to t’ to be prov(t) + prov(t’) Then, simplify – note result may be no change!
Else insert t with its provenance
Deletion
M1: R(x,y) S(y,x) M2: S(x,y) R(x,y)
R Provenance 1,2 p0 = t0 + M2(P6) 2,4 p1 = t1 + M2(p7) 2,1 p2 = M2(p4) 4,2 p3 = M2(p5)
S Provenance 2,1 p4 = M1(p0) 4,2 p5 = M1(p1) 1,2 p6 = M1(p2) 2,4 p7 = M1(p3)
Given –R(1,2) and –S(2,4)M1: -R(x,y) -S(y,x) M2: -S(x,y) -R(x,y)
1. Set p0 and p7 := 02. Simplify (may be nontrivial if mutual recursion)
R Provenance 2,4 p1 = t1 4,2 p3 = M2(p5)
S Provenance 4,2 p5 = M1(p1)
Summary: S-Tables and Provenance
More expressive than “why & where provenance,” [Buneman+ 01], lineage tracing [Cui + Widom 01], other formalisms Similar in spirit to “mapping routes” [Chiticaru+06], irrelevant
rule elimination [Levy+ 92]
If the set of mappings has a least fixpoint in datalog, it has one in our semantics Our polynomial captures all possible derivation paths “through
the mappings” – a form of “how provenance” (Tannen)
Gives us a means of performing incremental maintenance in a fully P2P model, even with cycles (that have least fixpoints)
Ongoing Work
Implementing the provenance-based maintenance algorithm Procedure can be cast as a set of datalog rules But: needs “slightly more” than SQL or stratified datalog
semantics
Inverse mappings We propagate updates “down” a mapping – what about
upwards? Necessary to support mirroring… Provenance makes it quite different from existing view update
literature
Performance! Lots of opportunities for caching antecedents, reusing computations across reconciliations, answering queries using views, multi-query optimization!
SHARQ [with Davidson, Tannen, Stoeckert, White]
ORCHESTRA is the core engine of a larger effort in bioinformatics information management: SHARQ (Sharing Heterogeneous, Autonomous Resources and
Queries) Develop a network of database instances, views, query forms, etc.
that: Is incrementally extensible with new data, views, query templates Supports search for the “the right” query form to answer a
question Accommodates a variety of different sub-communities Supports both browsing and searching modes of operation … Perhaps even supports text extraction and approximate
matches
SHAR
Related Work Incomplete information [Imielinski & Lipski 84], info source tracking
[Sadri 98] Inconsistency repair [Bry 97], [Arenas+99] Provenance [Alagar+ 95][Cui & Widom 01][Buneman+ 01][Widom+
05] Distributed concurrency control
Optimistic CC [KR 81], Version vectors [PPR+83], … View update [Dayal & Bernstein 82][Keller 84, 85], … Incremental maintenance [Gupta & Mumick 95], [Blakeley 86, 89],
… File synchronization and distributed filesystems
Harmony [Foster + 04], Unison [Pierce + 01]; CVS, Subversion, etc. Ivy [MMGC 02], Coda [Braam 98,KS 95], Bayou [TTP+’96], …
Treo [Widom+], MystiQ [Suciu+] Peer data management systems
Piazza [Halevy + 03, 04], Hyperion [Kementsietsidis+ 04], [Calvanese+ 04], peer data exchange [Fuxman + 05], Trento/Toronto LRM [Bernstein+ 02]
Conclusions
ORCHESTRA focuses on trying to coordinate disagreement, rather than enforcing agreement
1. Accommodate disagreement with an extended data model and trust policies
2. Reconcile updates at the transaction level3. Define update translation mappings to get all
data into the target schema
Ongoing work: implementing update mappings, caching, replication, biological applications