data integration techniques zachary g. ives university of pennsylvania cis 550 – database &...
Post on 20-Dec-2015
216 views
TRANSCRIPT
![Page 1: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/1.jpg)
Data Integration Techniques
Zachary G. IvesUniversity of Pennsylvania
CIS 550 – Database & Information Systems
October 30, 2003
Some slide content may be courtesy of Susan Davidson, Dan Suciu, & Raghu Ramakrishnan
![Page 2: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/2.jpg)
2
We Left Off with TSIMMIS
“The Stanford-IBM Manager of Multiple Information Sources” … or, a Yiddish stew
An instance of a “global-as-view” mediation system
One of the first systems to support semi-structured data, which predated XML by several years
This system, like the Information Manifold, focused on querying web sources Real-world integration companies (IBM, BEA, Actuate,
…) are focusing on the enterprise – more $$$!
![Page 3: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/3.jpg)
3
Queries in TSIMMIS
Specified in OQL-style language called Lorel OQL was an object-oriented query language Lorel is a predecessor to XQuery; OEM is a predecessor to
XML
Based on path expressions over OEM structures:select bookwhere book.author = “DB2 UDB” and book.title = “Chamberlin”
This is basically like XQuery, which we’ll use in place of Lorel and the MSL template language. Restating the query above:
for $b in document(“mediated-schema”)/bookwhere $b/title/text = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b
![Page 4: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/4.jpg)
4
Query Answering in TSIMMIS
Basically, it’s view unfolding, i.e., composing a query with a view The query is the one being asked The views are the MSL templates for the
wrappers Some of the views may actually require
parameters, e.g., an author name, before they’ll return answers These are called input bindings Common for web forms (see Amazon, Google, …) XQuery functions (XQuery’s version of views) support
parameters as well, so we’ll use these to illustrate
![Page 5: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/5.jpg)
5
A Wrapper Definition in MSL, Translated to XQuery
Wrappers have templates and binding patterns ($X) in MSL:
B :- B: <book {<author $X>}> // $$ = “select * from book where author=“ $X //
This reformats a SQL query over Book(author, year, title)
In XQuery, this might look like:define function GetBook($X AS xsd:string) as book* {
for $x in sql(“select * from book where author=‘” + $x +”’”)
return <book>$x<author>$x</author></book>}
![Page 6: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/6.jpg)
6
How to Answer the Query
Given our query:for $b in document(“mediated-schema”)/bookwhere $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b
We want to find all wrapper definitions that: Either output enough information that we can evaluate
all of our conditions over the output They return a book’s title, and author so we can test
against these Or have already “enforced” the conditions for us!
They already do a selection on author=“Chamberlin,” etc.
![Page 7: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/7.jpg)
7
Query Composition with Views
We find all views that define book with author and title, and we compose the query with each of these
In our example, we find one wrapper definition that matches:define function GetBook($x AS xsd:string) as book* {
for $b in sql(“select * from book where author=‘” + $x
+”’”)return <book>$b<author>$x</author></book>
}
for $b in document(“mediated-schema”)/bookwhere $b/title/text() = “DB2 UDB” and $b/author/text() = “Chamberlin”return $b
![Page 8: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/8.jpg)
8
Matching View Output to Our Query’s Conditions
Determine that the query tests for $x=“Chamberlin” by matching the query’s XPath, $b/author/text(), on the function’s output:
define function GetBook($x AS xsd:string) as book {for $b in
sql(“select * from book where author=‘” + $x +”’”)return <book>$b<author>$x</author></book>
}
let $x := “Chamberlin”for $b in GetBook($x)/bookwhere $b/title/text() = “DB2 UDB” return $b
![Page 9: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/9.jpg)
9
The Final Step: Unfolding
The expression:let $x := “Chamberlin”
for $b in { for $b in sql(“select * from book where author=‘” + $x +”’”)
return <book>$b<author>$x</author></book> }/bookwhere $b/title/text() = “DB2 UDB” return $b
Can be unnested (“unfolded”) and simplified to:for $b in sql(“select * from book where author=‘Chamberlin’”)where $b/title/text() = “DB2 UDB” return $b
![Page 10: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/10.jpg)
10
What Is the Answer?
Given schema book(author, year, title) and Datalog rules defining an instance:
book(“Chamberlin”, “1992”, “DB2 UDB”)book(“Chamberlin”, “1995”, “DB2/CS”)book(“Bernstein”, “1997”, “Transaction Processing”)
TSIMMIS is an instance of a global-as-view mediator with a semistructured data model
Can also have GAV mediators using Datalog or SQL, which work on similar principles
Queries and mappings are unfolded (macro-expanded + simplified)
![Page 11: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/11.jpg)
11
Limitations of Global-As-View
Some data sources may contain data that falls within certain ranges or has certain known properties “Books by Aho”, “Students at UPenn”, … How do we express these? (Important so we
reduce the number of sources we query!)
Mediated schema is basically the union of the various MSL templates – as they change, so may the mediated schema! Not good for scalability or flexibility
![Page 12: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/12.jpg)
12
Observations of Levy et al. inInformation Manifold Paper
When you integrate something, you have a conceptual model of the integrated domain Define that as a basic frame of reference – not the data
that’s in the sources
May have overlapping/incomplete sources Define each source as the subset of a query over the
mediated schema We can use selection or join predicates to specify that a
source contains a range of values:ComputerBooks(…) Books(Title, …, Subj),
Subj = “Computers”
![Page 13: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/13.jpg)
13
The Information Manifold
Defines the mediated schema independently of the sources! “Local-as-view” instead of “global-as-view” Assumes that we can only see a small subset of all the
possible facts – “open-world assumption” Allows us to specify information about data sources Focuses on relations (with OO extensions), Datalog
Guarantees soundness of answers, completeness of “certain answers” – those tuples that must exist Maximal set of tuples in query answer that are logically
implied by data at the sources, plus all mappings’ constraints
![Page 14: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/14.jpg)
14
The Local-as-View Model
Properties: “Local” sources are views over the mediated
schema Sources have the data – mediated schema is
virtual Sources may not have all the data from the
domain – “open-world assumption”
The system must use the sources (views) to answer queries over the mediated schema “Answering queries using views” …
![Page 15: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/15.jpg)
15
Answering Queries Using Views
Our assumption for today: conjunctive queries, set semantics Suppose we have a mediated schema:
author(aID, isbn, year), book(isbn, title, publisher) A conjunctive query might be:
q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
Recall intuitions about this class of queries: Adding a conjunct to a query removes answers from the
result but never adds any Any conjunctive query with at least the same
constraints & conjuncts will give valid answers
![Page 16: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/16.jpg)
16
Query Answering
Suppose we have the same query: q(a, t, p) :- author(a, i, _), book(i, t, p), t = “DB2 UDB”
and sources:s1(a,t) author(a, i, _), book(i, t, p), t = “123”s2(a,t) author(a, i, _), book(i, t, p), t = “DB2 UDB”s3(a,t,p) author(a, i, _), book(i, t, p), t = “123”s4(a,i) author(a, i, _), a = “Smith”s5(a,i) author(a, i, _)s6(i,p) book(i, t, p)
We want to compose the query with the source mappings – but they’re in the wrong direction!
![Page 17: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/17.jpg)
17
Inverse Rules
We can take every mapping and “invert” it, though sometimes we may have insufficient information: If
s5(a,i) author(a, i, _) then we can also infer that:
author(a, i, ???) s5(a,i)
But how to handle the absence of the 3rd attribute? We know that there must be AT LEAST one instance
of ??? in author for each (a,i) pair So we might simply insert a NULL and define that NULL
means “unknown” (as opposed to “missing”)…
![Page 18: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/18.jpg)
18
But NULLs Lose Information
Suppose we take these rules and ask for: q(a,t) :- author(a, i, _), book(i, t, p)
If we look at the rule:s1(a,t) author(a, i, _), book(i, t, p), t = “123”
Clearly q(a,t) :- s1(a,t) But if apply our inversion procedure, we get:
author(a, NULL, NULL) s1(a,t)book(NULL, t, p) s1(a,t), t = “123”
and there’s no way to figure out how to join author and book on NULL! We need “a special NULL for each a-t combo” so we can
figure out which a’s and t’s go together
![Page 19: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/19.jpg)
19
The Solution: “Skolem Functions”
Skolem functions: “Perfect” hash functions Each function returns a unique, deterministic value
for each combination of input values Every function returns a non-overlapping set of
values (Skolem function F will never return a value that matches any of Skolem function G’s values)
Skolem functions won’t ever be part of the answer set or the computation They’re just a way of logically generating “special
NULLs”
![Page 20: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/20.jpg)
20
Revisiting Our Example
Query: q(a,t) :- author(a, i, _), book(i, t, p)
Mapping rule:s1(a,t) author(a, i, _), book(i, t, p), t = “123”
Inverse rules:author(a, f(a,t), NULL) s1(a,t)
book(f(a,t), t, p) s1(a,t), t = “123”
We can now expand the query: q(a,t) :- author(a, i, NULL), book(i, t, p), i = f(a,t) q(a,t) :- s1(a,t), s1(a,t), t = “123”, i = f(a,t)
![Page 21: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/21.jpg)
21
Query Answering Using Inverse Rules
Invert all rules using the procedures described
Take the query and the possible rule expansions and execute them in a Datalog interpreter In the previous query, we expand with all
combinations of expansions of book and of author – every possible way of combining and cross-correlating info from different sources
Then we throw away all unsatisfiable rewritings (some expansions will be logically inconsistent)
![Page 22: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/22.jpg)
22
Levy et al. Alternative Approach:The Bucket Algorithm
Given a query Q with relations and predicates Create a bucket for each subgoal in Q Iterate over each view (source mapping)
If source includes bucket’s subgoal: Create mapping between q’s vars and the view’s
var at the same position If satisfiable with substitutions, add to bucket
Do cross-product of buckets, see if result is contained in the query (recall we saw an algorithm to do that)
![Page 23: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/23.jpg)
23
Source Capabilities
The simplest form is to annotate the attributes of a relation: Bookbff(auth,title,pub)
But many data integration efforts had more sophisticated models Can a data source support joins between its
relations? Can a data source be sent a relation that it should
join with? In the end, we need to perform parts of the
query in the mediator, and other parts at the sources
![Page 24: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/24.jpg)
24
Contributions of the Info Manifold
More robust way of defining mediated schemas and sources Mediated schema is clearly defined, less likely to change Sources can be more accurately described
Relatively efficient algorithms for query reformulation, creating executable plans
Still requires standardization on a single schema Can be hard to get consensus
Some other aspects were captured in related papers Overlap between sources; coverage of data at sources Semi-automated creation of mappings Semi-automated construction of wrappers
![Page 25: Data Integration Techniques Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 30, 2003 Some slide content may](https://reader036.vdocument.in/reader036/viewer/2022062516/56649d445503460f94a21374/html5/thumbnails/25.jpg)
25
Later Integration Systems Focused on Better Performance
Tukwila/Piazza [Ives+99,Halevy+02] – Washington Descendants of the Information Manifold Similar capabilities, but with adaptive
processing of XML as it is read across streams
Niagara [DeWitt+99] – Wisconsin XML querying of web sources Giving answers a screenful at a time
TelegraphCQ [Chandrasekaran+03] – Berkeley Adaptive, select-project-join queries over
infinite streams