mariposa and data integration i zachary g. ives university of pennsylvania cis 650 – database &...

Download Mariposa and Data Integration I Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems October 6, 2008

If you can't read please download the document

Upload: june-greer

Post on 18-Jan-2018

224 views

Category:

Documents


0 download

DESCRIPTION

3 Mariposa: Wide-Area Distributed DBs  Goal is to get rid of the single-administrative-domain assumption:  Dynamic data allocation  Multiple administrative structures  Heterogeneity of nodes  And specifically to have:  Scalability to many cooperating sites  Data mobility  Lack of global sync  Local autonomy  Configurable policies

TRANSCRIPT

Mariposa and Data Integration I Zachary G. Ives University of Pennsylvania CIS 650 Database & Information Systems October 6, 2008 Last Time: R* A generalization of System R to work on LANs Focus: cost modeling and different distributed join schemes But what about wide area distributed DBs? Mariposa Later in the term: PIER and other Internet scale processors 2 3 Mariposa: Wide-Area Distributed DBs Goal is to get rid of the single-administrative-domain assumption: Dynamic data allocation Multiple administrative structures Heterogeneity of nodes And specifically to have: Scalability to many cooperating sites Data mobility Lack of global sync Local autonomy Configurable policies 4 Core Idea of Mariposa Open markets capitalism works quite efficiently in matching buyers + sellers Different buyers have different needs, demands Different sellers have different resources, costs Use this model as the basis of resource allocation Services have brokers Participants (e.g., compute, data, storage providers) are sellers Clients place bids Data and updates can be fragmented, subscribed to, etc. 5 Mariposa Architecture Each node also has custom policy rules in a language called Rush 6 Mariposa Architecture 7 Mariposa Services Storage buyers may want to store their data Data the same data may be available in many places with different freshness levels Naming data needs names & metadata Query execution where does an optimized plan get executed? Brokers match service providers with buyers via bidding Most of functionality governed by local Rush rules 8 Storage Can be: Replicated in many places (with different guarantees) Fragmented across multiple systems (vertical or horizontal partitions) Fragments can be split or coalesced as needed (Never implemented?) Fragments bought and sold to maximize value Might want to subscribe to updates to a data fragment that we own 9 Naming and Finding Data Internal name = address (where an object is now) Full name = object ID Common name = user-specific alias Name context = a namespace Go to local cache, then go to name server Name server is a service and requires bidding Polls various local catalogs May have different QoS guarantees 10 Query Processing in Mariposa Distributed query optimization is REALLY hard Need to try all combinations of executing different parts of the query on different machines Regular optimization is already O(3 n ) or so So nobody really does full DQO Mariposa heuristic: Optimize as if were executing locally Fragment the plan, break into strides (parallelizable) A stride produces a materialized intermediate result Conduct bids on fragments 11 Optimization: Bidding on Fragments For each computable fragment (each fragment in a stride), use one of: Expensive bid protocol Send out bid Get back triples (Cost, Delay, Expiration of bid) Notify bidders of winner must meet cost and delay constraints LOTS of messages Purchase order protocol Send to most probable winner (not clear how we know this) Site returns answer + bill (no negotiation allowed) Heuristics to choose winners when many strides and bidders (e.g., consider each stride separately, use greedy algorithm to balance cost vs. delay) 12 Advertising and Pricing Services Here its not clear what really got implemented Service providers advertise in yellow pages May publish rates May need to provide coupons if overloaded Pricing is generally based on CPU and I/O resources Can adjust by preference for certain data Adjust by average load 13 Mariposa Wrap-up Contributions: Interesting ideas about applying economic models One of earliest systems to address wide area But ultimately unsuccessful System was never really deployed Work ended by ~1997 Today work continues in this space but normally with an additional assumption of multiple schemas 14 A Problem Weve seen that even with normalization and the same needs, different people will arrive at different schemas In fact, most people also have different needs! Often people build databases in isolation, then want to share their data Different systems within an enterprise Different information brokers on the Web Scientific collaborators Sellers of products Researchers who want to publish their data for others to use This is the goal of data integration: tie together different sources, controlled by many people, under a common schema 15 Integrating Data Several steps: Getting data out from a data source it may have its own query/retrieval interface Sometimes it will need a query before it returns answers, e.g., a Web form Getting all of the data into the same data model and format Translating the data into the same schema (schema matching + mappings) Finding mentions of the same item (entity resolution) Answering queries (reformulation) Cleaning! Data Integration in Many Guises There are many different forms of integration, with different assumptions: Enterprise Information Integration (EII) Generally clean(er), mostly DB and spreadsheet sources Web information integration Lots of HTML + XML, autonomy, entity resolution, Best effort information integration Give quick and dirty results easily, worry about perfect later Collaborative data sharing Propagate conflicting data + updates among different databases Each of these has different levels of cleaning, entity resolution, and conflicting data 16 17 Example We want to build the UltimateMovieGuide TM Given: TV Guide schema: ShowingAt(time, title, year) Shows(title, year, genre, rating) Director(title, year, name) Starring(title, year, name) GoodMovies.com: FourStarMovie(title, year, genre) DecentMovie(title, year) OscarWinners: WinningDirector(director, title, year) Documentaries.org: Documentary(title, year, director, producer) 18 Data Warehouses Offline Replication Get experts together, define a schema they think best captures all the info Define a database with this schema Define procedural mappings in an ETL tool to import the data Perhaps perform data cleaning Periodically copy all of the data from the data sources Note that the sources and the warehouse are basically independent at this point Remote, Autonomous Data Sources Data Warehouse Query Results 19 Pros and Cons of Data Warehouses Need to spend time to design the physical database layout, as well as logical This actually takes a lot of effort! Data is generally not up-to-date (lazy or offline refresh) Queries over the warehouse dont disrupt the data sources Can run very heavy-duty computations, including data mining and cleaning 20 An Alternative Mediators or Virtual Integration Systems Get experts together, define a schema they think best captures all the info Define as a virtual mediated schema Create declarative mappings specifying how to get data from each source into the warehouse Evaluate queries over the mediated schema on the fly using the current data at the sources Data Integration System Mediated Schema Remote, Autonomous Data Sources Schema Mappings Source Catalog Query Results Schema Mappings We generally use queries as the basis of mappings Goals: Compose a query with a set of mappings Intersect the constraints in the query and mappings only returning data matching the constraints (Possibly) compose chains of mappings (Possibly) invert mappings The basic formalism: mappings are conjunctive queries Q(X) :- R1(X 1 ), R 2 (X 2 ), , c 1 (X 1 ) And both queries and the overall set of mappings are unions of conjunctive queries Why: tractability! 21 22 The Job of Mappings Between different data sources: May have different numbers of tables different decompositions Attributes may be broken down differently (rating vs. EbertThumb and RoeperThumb) Metadata in one relation may be data in another Values may not exactly correspond (shows vs. movies) It may be unclear whether a value is the same (COPPOLA vs. Francis Ford Coppola) May have different, but synonymous terms (ImdbID 123456 SSN ) Might have sub/superclass relationships 23 General Techniques Value-value correspondences accomplished using concordance tables Join through a table mapping values to values Imdb_Actor(ID, SAG_actor_name) Table-multitable correspondences accomplished using joins (in one direction), projections (in other direction) Key question: what happens if a needed attribute is missing? (e.g., DecentMovie has no genre) Super/subclass relationships generally must be captured using selection (in one direction), union (in other direction) And sometimes we just cant specify the correspondence! 24 Some Examples of Mappings Show( ID, Title, Year, Lang, Genre ) Movie( ID, Title, Year, Genre, Director, Star1, Star2 ) EnglishMovie( Title, Year, Genre, Rating ) Docu( ID, Title, Year ) Participant( ID, Name, Role ) ImdbIDCastOf 1234Catwoman NameCastOf Berry, H.Monsters Ball PieceOfArt(I, T, Y, English, G) :- EnglishMovie(T, Y, G, _), MovieIDFor(I, T, Y) Movie(I, T, Y, doc, D, S1, S2) :- Docu(I, T, Y), Participant(I, D, Dir), Participant(I, S1, Cast1), Participant(I, S2, Cast2) T1 T2 Need a concordance table from ImdbIDs to actress names Query Answering with Mappings: Reformulation Inputs: a query Q, a set of mappings , and a set of sources S M1(X) :- R1(X 1 Y 1 ), R 2 (X 2 Y 2 ), , c 1 (X 1 Y 1 ), X M2(X) Y 1 Y 2 R1(X 1 Y 1 ) R 2 (X 2 Y 2 ) c 1 (X 1 Y 1 ) Goal: a set of rewritings Q, expressed as a union of conjunctive queries over S which typically returns the set of all certain answers those answers implied by the base data and the constraints expressed in the mappings 25 Kinds of Schema Mappings Global As View (GAV): X M(X) Y 1 Y 2 R1(X 1 Y 1 ) R 2 (X 2 Y 2 ) c 1 (X 1 Y 1 ) Q(X) :- M R (X 1,Y 1 ), M S (X 2,Y 2 ), Local As View (LAV): X M R (X) Y 1 Y 2 R1(X 1 Y 1 ) R 2 (X 2 Y 2 ) c 1 (X 1 Y 1 ) Q(X) :- M R (X 1,Y 1 ), M S (X 2,Y 2 ), Global-Local As View (GLAV), aka Tuple-Generating Dependencies (TGDs): X M R (X 1 Z 1 ),M S (X 2 Z 1 ) Y 1 Y 2 R1(W 1 Y 1 ) R 2 (W 2 Y 2 ) c 1 (X 1 Y 1 ) where X1 X2 = W1 W2 26 Query Reformulation in Global-As-View The most traditional scheme, implemented in most commercial systems Mediated schema is a view over source data Example real-world systems: IBM DB2 / WebSphere Information Integrator; Oracle Fusion Reuses query unfolding capabilities from a DBMS: Query over a View over Base data Query over Base data 27 Query Unfolding: Basic Procedure V1(x,y,z) :- R1(x,y,w), R2(w,u), R3(u,z) V2(x,y) :- R1(x,u), R2(u,y), R3(y,w), R4(w,z) Q(u) :- V1(u,v,w), V2(x,y) Substitute the body of V1 into Q, renaming appropriately; repeat for V2 28 Challenges If there are multiple rules for a view, unfolding may generate an exponential number of queries Each query might be non-minimal Leads to reasoning about query containment and equivalence If containment holds both ways between Q1, Q2 then they are equivalent 29 30 Example of Containment Suppose we have two queries: q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C), Course(C, Systems) q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X) Intuitively, q1 must contain the same or fewer answers vs. q2: It has all of the same conditions, except one extra conjunction (i.e., its more restricted) Theres no union or any other way it can add more data We can say that q2 contains q1 because this holds for any instance of our DB {Student, Takes, Course} 31 Checking Containment via Canonical DBs To test for q1 q2: Create a canonical DB that contains a tuple for each subgoal in q1 Execute q2 over it If q2 returns a tuple that matches the head of q1, then q1 q2 (This is an NP-complete algorithm in the size of the query. Testing for full first-order logic queries is undecidable!!!) Lets see this on the previous example 32 Example Canonical DB q1(S,C) :- Student(S, N), Takes(S, C), Course(C, X), inCSE(C), Course(C, Systems) q2(S,C) :- Student(S, N), Takes(S, C), Course(C, X) StudentTakesCourseinCSE SN SC CX CSystems S Need to get tuple in executing q2 over this database Global-As-View Very easy to implement doesnt require any new logic on the part of a regular DBMS engine For instance, Starburst QGM rewrites would work But some drawbacks primarily that: We dont have a mechanism to describe when a source contains only a subset of the data in the mediated schema e.g., All books from this source are of type textbook The mediated schema often needs to change as we add sources it is somewhat brittle because its defined in terms of sources 33 34 Next Time Well look at alternatives to the Global-as-View model Please read the MiniCon paper (Pottinger & Halevy)