itrails: pay-as-you-go information integration in dataspaces
DESCRIPTION
iTrails: Pay-as-you-go Information Integration in Dataspaces. Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas Blunschi ETH Zurich VLDB 2007. Anat Heilper Jan. 2009 CS Seminar in Databases (236826). 1. Problem: Querying heterogeneous data Sources. Query. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/1.jpg)
iTrails: Pay-as-you-go Information Integration in Dataspaces
Authors: Marcos Vaz Salles, Jens Dittrich, Shant Karakashian, Olivier Girard, Lukas
Blunschi
ETH Zurich
VLDB 2007
Anat Heilper
Jan. 2009
CS Seminar in Databases (236826)
1
![Page 2: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/2.jpg)
Problem: Querying heterogeneous data Sources
Data Sources
Laptop Email Server
WebServer
DBServer
What is the impact of the global depression in Israel?Query
Systems
? ? ? ?
2
![Page 3: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/3.jpg)
Solution 1: Use a Search Engine
Data Sources
LaptopEmail Server
WebServer
Query
System
DBServer
Graph IR Search Engine
global depression Israel
TopX [VLDB05], FleXPath [SIGMOD04], XSearch [VLDB03], XRank [SIGMOD03]
text,links
text,links
text,links
text,links
Query semantics are not precise!
3
![Page 4: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/4.jpg)
Result Query
Solution 2: Use an Information Integration System
Data source 2Data source 1 Data source 3
Query interface Global schema
Source schema
?
Price index
countriesunemployment
Crime rate
countriesunemployment
Crime rate
Too much effort to provide schema mappings!
44
![Page 5: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/5.jpg)
•Schema first approach (SFA)• Semantically integrated view over the data sources• Mappings between source schemas and mediated schema Queries have clearly defined semantics Expensive to construct and maintain Not all data sources have schemas
•No schema approach (NSA)• Keyword search• Requires good result ranking methods Performs no integration Query semantics is not well defined
2 opposite approaches :
Querying heteregenous data sources
5
![Page 6: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/6.jpg)
Motivation of iTrail Find a integration solution in-between these two extremes?
?Dataspace System
Graph IR Search Engine
Data IntegrationSystem
Temps Cities
CO2 Sunspots
... ...
...
...
text,links
text,links
text,links
text,links
Pay-as-you-goInformation Integration The more effort you pay,
the more query power you have.
6
![Page 7: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/7.jpg)
iTrails Core Idea: Add Integration Hints Incrementally
1) Provide search service over the data– Use general graph data model (iDM)– handles unstructured documents, XML, and relations
2) Add integration semantics via hints (trails)
3) If more semantics needed, apply trails– Smooth transition between search and data integration– Semantics added incrementally to improve precision /
recall
7
![Page 8: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/8.jpg)
Example of an iDM
X1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..
home
Mike
papers
PIM
SIGMOD42.pdf
SIGMOD44.pdf
QP
VLDB12.pdf
VLDB10.pdf
projects
PIM
SIGMOD42.pdf8
1
2
5
![Page 9: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/9.jpg)
General graph data model - iDM
iDM (iMeMeX Data Model) represents every structural component of the input data as a node.
Supports unstructured, semi-structured and structured data, e.g., files&folders, XML, relations
9
![Page 10: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/10.jpg)
iMeMeX – integrated MeMeX
Vannevar Bush introduced the concept “memex” in the 1945s: "device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility."
Bush predicted: "Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified."
10
![Page 11: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/11.jpg)
Data model
Data represented by directed graph G = (RV, E) RV: {V1, . . . Vn} termed resource view
E: Ordered pairs (Vi , Vj ) of resource views
Vi Vj : Vj is reachable from Vi by traversing the edges E
11
![Page 12: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/12.jpg)
Resource view
Component
Vi.name string
Vi.Tuple sequence of attribute value pairs ((att0, val0), (att1, val1),… )
Vi.content text
A resource view Vi has three components: name, tuple, and content
{.name= ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = 04.01.2007‘}, .content = ‘@PDF . . . ‘}
12
![Page 13: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/13.jpg)
Query model
Query expression:– Query Q selects nodes R := Q(G) G.RV– Example: //mike/papers
Component projection– C {.name, .tuple.<atti>, .content} : projection of
set of resource views selected by query Q, i.e. set of components R’ := {Vi.C | Vi Q(G)}
13
![Page 14: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/14.jpg)
Component projection example
Example: //mike//PIM/*.tuple.lastmodifiedX1 = { .name = ‘home‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘05.01.2000‘}, .content = “}X2 = { .name = ‘mike‘, .tuple = {.owner = ‘root‘, .lastmodified = ‘04.17.2008‘}, .content = “}. . .X5 = { .name = ‘SIGMOD42.pdf ‘, .tuple = {size = 10k, .owner = ‘mike‘, .lastmodified = ‘04.01.2007‘}, .content = ‘@PDF . . . ‘}…..
home
Mike
papers
PIM
SIGMOD42.pdf
SIGMOD44.pdf
QP
VLDB12.pdf
VLDB10.pdf
projects
PIM
SIGMOD42.pdf
1
2
5 14
![Page 15: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/15.jpg)
Syntax of query expression
QUERY_EXPRESSION := (PATH | KT_PREDICATE) (union QUERY_EXPRESSION)*
PATH := (LOCATION_STEP)+
LOCATION_STEP := LS_SEP NAME_PREDICATE (`[` KT_PREDICATE `]`)?
LS_SEP := `//` | `/`
NAME_PREDICATE := `*` | (`*`) ? VALUE (`* `)?
KT_PREDICATE := (KEYWORD | TUPLE) (LOGOP KT_PREDICATE)*
KEYWORD := `”` VALUE (WHITESPACE VALUE) * `”` | VALUE
(WHITESPACE KEYWORD)*
TUPLE := ATTRIBUTE_IDENTIFIER OPERATOR VALUE
OPERATOR := `=` | `<` | `>`
LOGOP := `AND` | `OR`15
![Page 16: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/16.jpg)
semanticsAll nodes in graph
All nodes in graph that have ‘a’ in its content
All nodes in graph that have ‘a’ and ‘b’ in its content
All nodes in graph such that .name== ‘A’
nodes that .name== ‘B’ and there is an edge from W w.name == ‘A’
16
![Page 17: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/17.jpg)
Logical algebra for query expressions
Operator Name semantics
G All resource views {V|V G.RV}
P(I) Selection {V|V I P(V)}
(I) Shallow unset {W|(V,W) G.E V I}
(I) Deep unset {V|V W V I}
I1 I2 intersection {V|V I1 V I2}
I1I2 union {V|V I1 V I2}
17
![Page 18: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/18.jpg)
Example
18
![Page 19: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/19.jpg)
What have we seen so far?
Problem: querying heterogeneous data sources
Find a solution between SFA and NSA– Generic graph data model to describe the data– queries describes paths in the graph
19
![Page 20: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/20.jpg)
How itrails help?
Queries are modified by hints ( trails) which adds/modifies search paths to look at.
Example: yesterday → //*[date = today() – 1]
20
![Page 21: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/21.jpg)
iTrails: Defining Trails
Basic Form of a Trail
QL [.CL] → QR [.CR] Intuition:
When I query for QL [.CL], you should also query for QR [.CR]
–
Queries: keyword and path expressions
Attribute projections
![Page 22: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/22.jpg)
iTrails: Defining Trails
Unidirectional trail
QL [.CL] → QR [.CR]
Intuition:– When query for QL [.CL], also query for QR [.CR]
Bidirectional trail QL [.CL] QR [.CR]
Example:ψi :=//*.tuple.date //*.tuple.modified
Queries:keyword and path expressions
Attribute projections
Query example:global warming zurich
or//Temperatures/*[celsius>10]
22
![Page 23: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/23.jpg)
20
15
14
BE
ZH
ZH
Trail Examples: Global Warming Zurich
Trail for Implicit meaning:
query for global warming, also query Temperature data > 10 degrees”
Trail for an Entity: When query for zurich, query for references of zurich as a region
global warming → //Temperatures/*[celsius > 10]
Temperaturescity celsiusdate
Bern24-Sep
24-Sep
Zurich25-Sep
zurich → //*[region = “ZH”]
Uster
region
global warming zurich
9ZHZurich26-Sep
23
![Page 24: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/24.jpg)
Trail Example: Deep Web Bookmarks
Trail for a Bookmark: Query for train home, also query Train website:origin = TelAviv Unidestination = Haifa Hof Hacarmel
train home
train home →//trainCompany.com//*[origin=“Tel Aviv Uni”
and dest =“HAifa Hof Hacarmel”]
WebServer
24
![Page 25: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/25.jpg)
Trail Examples: Thesauri, Dictionaries, Language-agnostic Search
Trail for Thesauri: query for car, also query for auto
Trails for Dictionary: query for car, also query for carro and vice-versa
car auto
car automobile
car → auto
car → automobile automobile → carLaptop Email
Server
25
![Page 26: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/26.jpg)
Trail Examples: Schema Equivalences
Trail for schema match on names: query for Employee.empName, also query for Person.name
Trail for schema match on salaries: query for Employee.salary, also query for Person.income
EmployeeempName salary
Personname age income
//Employee//*.tuple.empName → //Person//*.tuple.name
//Employee//*.tuple.salary → //Person//*.tuple.income
DBServer
empId
SSN
26
![Page 27: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/27.jpg)
How are Trails Created?
Given by the user– Explicitly– Via Relevance Feedback
(Semi-)Automatically– Automatic schema matching– Ontologies and thesauri (e.g., wordnet)– User communities (e.g., trails on gene data, bookmarks
)
27
![Page 28: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/28.jpg)
Uncertainty and Trails
Probabilistic Trails: – model uncertain trails– probabilities used to rank trails
QL [.CL] → QR [.CR], 0 ≤ p ≤ 1– Example: car → auto, p = 0.9
probability p reflects the likelihood that results obtained by trail are correct.
28
![Page 29: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/29.jpg)
Certainty and Trails - continue
Scored Trails: – Give higher value to certain trails– Scoring Factors: boost scores of results obtained by the trail
QL [.CL] → QR [.CR], sf > 1. examples– T1: weather →sf //Temperatures/*, sf ≥ 1
– T2: yesterday → sf //*[date = today() – 1], sf ≥ 1
Intuition: sf reflects the relevance of the trail. – Results obtained are scored sf times higher than the results
obtained without the trail. – If no scoring factor is available, sf = 1
29
![Page 30: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/30.jpg)
Rewriting Queries with Trails
U
weather yesterday
(1) Matching
T2: yesterday → //*[date = today() – 1]
Query
(2) Transformation
TrailU
weather
yesterday
U//*[date = today() – 1]
(3) Merging
T2
matches
30
![Page 31: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/31.jpg)
Replacing Trails
Trails that use replace instead of union semantics
Uweather yesterday
(1) Matching
T2: yesterday //*[date = today() – 1]
Query
(2) Transformation
Trail
Uweather //*[date = today() – 1]
(3) Merging
T2
matches
31
![Page 32: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/32.jpg)
...U
Problem: Recursive Matches (1/2)
U
weather
yesterday
U
//*[date = today() – 1]
T2: yesterday →
//*[date = today() – 1]
New query still matches T2,
so T2 could be applied
againU
weather U
yesterday
U//*[date = today() – 1]
//*[date = today() – 1]U
//*[date = today() – 1]
//*[date = today() – 1]
...
Infinite recursion
T2
matches
T2
matches 32
![Page 33: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/33.jpg)
Problem: Recursive Matches (2/2)
U
weather
yesterday
U//*[date = today() – 1]
Trails may be mutually recursive
T3: //*.tuple.date → //*.tuple.modified
U
weather U
yesterday
//*[date = today() – 1]
T10: //*.tuple.modified → //*.tuple.date
U//*[modified = today() – 1]
U
weather Uyesterday
//*[date = today() – 1]U
//*[modified = today() – 1]U //*[date = today() – 1]
We again match T3
and enter an infinite loop
T3 matches
T10 matches
33
![Page 34: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/34.jpg)
Algorithm to solve recursion - MMCA
Multiple Match Coloring Algorithm (MMCA):– Keep history of all trails matched or introduced– Given a set of trails Y. For every trail t in Y:– Apply t to Q iteratively and color the query tree
nodes in Q according to the trails that already touched those nodes
34
![Page 35: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/35.jpg)
U
weather yesterday
First Level
U
weatheryesterday
//Temperatures/*
UU
//*[date = today() – 1]
U
weatheryesterday
//Temperatures/*
UU
//*[modified = today() – 1]
UU
//*[received = today() – 1]
//*[date = today() – 1]
SecondLevel
T1
matches
T2
matches
T3, T4 match
Multiple Match Coloring Algorithm
T1: weather → //Temperatures/*
T2:yesterday → //*[date =today()-1]
T3://*.tuple.date →//*.tuple.modified
T4://*.tuple.date →//*.tuple.received 35
![Page 36: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/36.jpg)
MMCA is exponential in number of levels– Every leaf can be applied any of the trails, and
each trail can generate additional leafs.
Solution: Trail Pruning– Number of levels – punish recursive rewrites– Top-K trails matched in each level
Ranking by probability/certainity/weight
– Other - timeout, progressively compute query results
Multiple Match Coloring Algorithm cont.
36
![Page 37: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/37.jpg)
iTrails Evaluation in iMeMex
Main Questions in Evaluation– Quality: Top-K Precision and Recall– Performance: Use of Materialization– Scalability: Query-rewrite Time vs. Number of
Trails
37
![Page 38: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/38.jpg)
iTrails Evaluation in iMeMex
Scenario 1: Few High-quality Trails– Closer to information integration use cases– Obtained real datasets and indexed them– 18 hand-crafted trails– 14 hand-crafted queries
Scenario 2: Many Low-quality Trails– Closer to search use cases– Randomly generated up to 10,000 trails and queries
with a mutual uniform match probability of 1%
38
![Page 39: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/39.jpg)
iTrails Evaluation in iMeMex: Scenario 1
Configured iMeMex to act in three modes– Baseline: Graph / IR search engine– iTrails: Rewrite search queries with trails– Perfect Query: Semantics-aware query
Data: shipped to central index
Laptop Email Server
WebServer
DBServer
sizes in MB
39
![Page 40: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/40.jpg)
Trails and queries used in Scenario 1
max original tree size: 14max final tree size after applying trails: 35max # of trails applied: 5
40
![Page 41: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/41.jpg)
Quality: Top-K Precision and Recall (k=20)
SearchEngine misses relevantresults
SearchQuery is partially
semantics-aware
Scenario 1: few high-quality Trails (18 trails)
Queries
perfect query
Perfect Query always has precision and recall
equal to 1
41
![Page 42: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/42.jpg)
Performance: Use of Materialization
Trail merging adds overhead to query execution
Trail Materialization improves performence for almost all queries
Scenario 1: few high-quality trails (18 trails)
42
![Page 43: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/43.jpg)
Scalability: Query-rewrite Time vs. Number of Trails – scenario 2
• No pruning approach exponential growth in the query plan sizes • Query-rewrite time can be controlled with pruning
43
![Page 44: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/44.jpg)
summary
First framework to explore pay-as-you-go information integration in dataspaces
iTrails: generic method to model semantic relationships gradually
Itrails are used to rewrite queries Algorithm to control recursive query rewrites
44
![Page 45: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/45.jpg)
Personal opinion - advantages
The method is incremental– Integrators can collect statistics, find most common
queries and define trails for popular queries first.– Dynamic system: If popular queries changes over
time, trails for less popular queries can be disabled to reduce system workload.
Trails can be defined independently by domain expects for each data domain.
45
![Page 46: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/46.jpg)
Personal opinion - disadvantages
Trails are global: every rewritten query is evaluated over every data source.
– Trail can have different meaning for different data sources.
For a good quality of query results, trails have to be defined manually problem for large systems. Solution: use machine learning techniques to improve automatic
trails creation
Overlaps and inconsistencies in trails are possible since query returns union of the results satisfying all trails Solution: trail mining and weighting would be helpful here.
46
![Page 47: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/47.jpg)
Questions?
47
![Page 48: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/48.jpg)
Bibliography
iTrails: Pay-as-you-go Information Integration in Dataspaces:Marcos Antonio Vaz Salles JensPeter Dittrich Shant Kirakos Karakashian
Olivier René Girard Lukas Blunschi ETH Zurich 8092 Zurich, Switzerland dbis.ethz.ch | iMeMex.org
From Databases to Dataspaces: A New Abstraction for Information Management:Michael Franklin University of California, Berkeley, Alon Halevy Google Inc. and U. Washington, David Maier Portland State University
Wikipedia, dataspace:http://en.wikipedia.org/wiki/Data_Spaces, memex:http://en.wikipedia.org/wiki/Vannevar_Bush
Imemex information: http://imemex.ethz.ch/
48
![Page 49: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/49.jpg)
Backup slides
49
![Page 50: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/50.jpg)
Algorithm runtime:– L: Number of leaves in query Q– M: Max number of leaves in query introduced by a trail– N: Number of trails– d {1, . . . ,N} number of levels
Theorem: Maximum number of trail applications performed by MMCA and maximum number of leaves in the merged query tree are both bounded by O(L • M^d )
Multiple Match Coloring Algorithm Analysis
50
![Page 51: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/51.jpg)
MMCA run time analysis (O(L•M^d ) )
If trail t is matched in query Q, it colors Q leaf nodes Subtree containing only these nodes is not matched again by t. Worst case, in each level only one of the trails matches for
each of the leaves. 1st run: Trail match M new leaves for each of those leaves
total of LM new nodes plus L old nodes L(M+1) leaves and L trail applications for the first level.
2nd run: t doesn’t match any of the leaves anymore (they are colored in 1st run).
However, all leaves may be matched against N −1 colors. Worst case, again, only one of the trails matches for each of
the existing leaf nodes. In the d-th level, will lead to L(M+1)^(d−1) trail applications and
a total of L(M+1)^d leaves.
51
![Page 52: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/52.jpg)
iDM: Lazily Computed Graph
Nodes and edges are lazily computed Each node is a Resource View
52
![Page 53: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/53.jpg)
iDM: Lazily Computed Graph
iDM is not a static model– Every component of every Resource View may be created on
demand– Every Resource View may be created on demand
Behind the scenes, obtaining the content may:– Read a file on the filesystem– Access a page on the web– Fetch the data from an index structure
Behind the scenes, obtaining the group may:– Get the children of a folder in the filesystem– Look up an edge replica– Obtain the sections of a document
53
![Page 54: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/54.jpg)
How to implement iDM: Architectural Perspective
Indexes&Replicas access (warehousing)
Data source access (mediation)
Complex operators (query algebra)
OperatorsPhysicalAlgebra
Data StoreResultCache
CatalogiQL
Query Processor
DataOperatorsCleaning
Replicas
Indexes&Data Store
CatalogiDM
Query Processor
Operators
Catalog
ContentConverters
Data SourceQuery
Processor
Data SourcePlugins
iMeMex PDSMS
Search & Browse Office ToolsEmail ...
DBMS
Application Layer
Data Source Layer
...
...IMAPFile System...54
![Page 55: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/55.jpg)
Data management approaches
Features
Integration Solution
Search Dataspaces Data Integration
Integration Effort
Low Pay-as-you-go
High
Query Semantics
Precision / Recall
Precision / Recall
Precise
Need for Schema
Schema-never
Schema-later Schema-first
55
![Page 56: iTrails: Pay-as-you-go Information Integration in Dataspaces](https://reader036.vdocument.in/reader036/viewer/2022081516/56814a88550346895db79b47/html5/thumbnails/56.jpg)
Canonical form
The canonical form of Г(Q) of a query Q is obtained by decomposing Q into location step separators and predicates (P) according to grammar. Г(Q) is constructed by the following recursion:
G if tree is empty
Tree = (tree) if LS_SEP=// and not first location step
μ(tree) if LS_SEP=/ and not first location step
tree σp(G) otherwise
56