merging ranks from heterogeneous internet sources hector garcia-molina luis gravano stanford...
Post on 20-Dec-2015
213 views
TRANSCRIPT
![Page 1: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/1.jpg)
Merging Ranks from Merging Ranks from Heterogeneous Internet Heterogeneous Internet
SourcesSources
Hector Garcia-MolinaHector Garcia-Molina
Luis GravanoLuis Gravano
Stanford UniversityStanford University
![Page 2: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/2.jpg)
Luis GravanoLuis Gravano 22Stanford UniversityStanford University
Users Have Many Available Users Have Many Available Information SourcesInformation Sources
Source 1Source 1 hh1111, h, h1212, h, h1313, ..., ...
Source 2Source 2
......
Nothing!Nothing!
User QueryUser Query Query ResultsQuery Results
““Houses Houses near near
Palo AltoPalo Alto for around for around $300K$300K.”.”
![Page 3: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/3.jpg)
Luis GravanoLuis Gravano 33Stanford UniversityStanford University
ChallengesChallenges
• Sources are Sources are too numeroustoo numerous• Sources are Sources are heterogeneousheterogeneous
(query language, model, results)(query language, model, results)
• Users want a Users want a single query resultsingle query result
![Page 4: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/4.jpg)
Luis GravanoLuis Gravano 44Stanford UniversityStanford University
MetasearcherMetasearcher
• Selects the good sources for a Selects the good sources for a queryquery
• Extracts and combines the query Extracts and combines the query results from the sourcesresults from the sources
![Page 5: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/5.jpg)
Luis GravanoLuis Gravano 55Stanford UniversityStanford University
Text Sources Rank Query Text Sources Rank Query ResultsResults
Text SourceText Source
Doc 1: Doc 1: 0.80.8Doc 2: Doc 2: 0.60.6
......
““Distributed Distributed Databases”Databases”
![Page 6: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/6.jpg)
Luis GravanoLuis Gravano 66Stanford UniversityStanford University
StructuredStructured Sources on the Sources on the Internet also Rank ResultsInternet also Rank Results
A real-estate agent receives A real-estate agent receives queries onqueries on LocationLocation and and PricePrice::
Q:Q: “Houses with preferred location “Houses with preferred location in in Palo AltoPalo Alto and preferred price and preferred price
around around $300K$300K.”.”
![Page 7: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/7.jpg)
Luis GravanoLuis Gravano 77Stanford UniversityStanford University
The Agent Ranks its Houses Based The Agent Ranks its Houses Based on its Own Scoring Functionon its Own Scoring Function
Q:Q: “Houses with preferred location in “Houses with preferred location in Palo Palo AltoAlto and preferred price around and preferred price around $300K$300K.”.”
Rank House ID Source Score Location Price1 MV1 0.43 Mountain View $350K2 MV2 0.42 Mountain View $360K3 PA1 0.28 Palo Alto $600K
![Page 8: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/8.jpg)
Luis GravanoLuis Gravano 88Stanford UniversityStanford University
A A Metasearcher Metasearcher then Faces then Faces Two ProblemsTwo Problems
• Extracting the top objectsExtracting the top objects from from the underlying sourcesthe underlying sources
• Merging the resultsMerging the results from the from the various sourcesvarious sources
![Page 9: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/9.jpg)
Luis GravanoLuis Gravano 99Stanford UniversityStanford University
MergingMerging Query Results is Query Results is Easy with Enough InformationEasy with Enough InformationGiven a record like:Given a record like:
the metasearcher ignores thethe metasearcher ignores the Source Source scorescore and computes its and computes its Target scoreTarget score from from the Location and Pricethe Location and Price
Rank House ID Source Score Location Price1 MV1 0.43 Mountain View $350K
![Page 10: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/10.jpg)
Luis GravanoLuis Gravano 1010Stanford UniversityStanford University
ExtractingExtracting the Top Objects the Top Objects from a Source is Hardfrom a Source is Hard
The metasearcher’s scoring function The metasearcher’s scoring function might be different from the source’s!might be different from the source’s!
Rank House ID Target Score Location Price1 PA1 1 Palo Alto $600K2 MV1 0.51 Mountain View $350K3 MV2 0.5 Mountain View $360K
![Page 11: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/11.jpg)
Luis GravanoLuis Gravano 1111Stanford UniversityStanford University
We Want to Avoid Extracting We Want to Avoid Extracting All the Source’s ContentsAll the Source’s Contents
Assume a house Assume a house hh with: with:
•Source(Q, h) = 0Source(Q, h) = 0 (worst for source)(worst for source)
•Target(Q, h) = 1 Target(Q, h) = 1 (best for metasearcher)(best for metasearcher)
Problem!Problem!
![Page 12: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/12.jpg)
Luis GravanoLuis Gravano 1212Stanford UniversityStanford University
The Example Query is The Example Query is Not ManageableNot Manageable at the Agent at the Agent
A query Q is A query Q is manageablemanageable at a source at a source if if < 1 such that:< 1 such that:
SourceSource
TargetTarget(0,0)(0,0)
(1,1)(1,1)
Source(Q, h) Source(Q, h) Target(Q, h)-Target(Q, h)-
![Page 13: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/13.jpg)
Luis GravanoLuis Gravano 1313Stanford UniversityStanford University
Single-Attribute Queries Are Single-Attribute Queries Are More Likely to be ManageableMore Likely to be Manageable
Single-attribute queries for Q:Single-attribute queries for Q:
• QQ11:: Location = Palo AltoLocation = Palo Alto
• QQ22:: Price = $300KPrice = $300K
![Page 14: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/14.jpg)
Luis GravanoLuis Gravano 1414Stanford UniversityStanford University
The Example Becomes The Example Becomes Tractable!Tractable!
… … if the top if the top TargetTarget objects for objects for QQ are among the top are among the top SourceSource
objects for objects for QQ11 andand Q Q22
![Page 15: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/15.jpg)
Luis GravanoLuis Gravano 1515Stanford UniversityStanford University
A A CoverCover Bounds the Target Bounds the Target Scores for QScores for Q
QQ11, …, Q, …, Qmm single-attribute queries form a single-attribute queries form a
cover cover for Q if for Q if g g11, …, g, …, gmm, G such that:, G such that:
Target(QTarget(Qii, h) , h) g gii Target(Q, h) Target(Q, h) G G
![Page 16: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/16.jpg)
Luis GravanoLuis Gravano 1616Stanford UniversityStanford University
Having a Having a Manageable CoverManageable Cover for a for a Query is Query is SufficientSufficient......
Manageable Cover Manageable Cover for query Q at source Sfor query Q at source S
““Efficient” ExecutionsEfficient” ExecutionsPossible at SPossible at S
![Page 17: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/17.jpg)
Luis GravanoLuis Gravano 1717Stanford UniversityStanford University
Having a Having a Manageable CoverManageable Cover for a for a Query is Query is SufficientSufficient......
(1) Pick a manageable cover C = {Q(1) Pick a manageable cover C = {Q11, ..., Q, ..., Qmm} for Q at S} for Q at S
(2) For i = 1 to m: Find (2) For i = 1 to m: Find i i for Q for Qii
(3) Pick 0 (3) Pick 0 gg11, ..., g, ..., gmm, G < 1 for cover C, G < 1 for cover C
(4) For i = 1 to m(4) For i = 1 to m
(5) Retrieve all objects t with Source(Q(5) Retrieve all objects t with Source(Q ii, t) , t) G Gi i = g= gii - - i i
(6) Compute Target(Q, t) for all objects t retrieved(6) Compute Target(Q, t) for all objects t retrieved
(7) If (7) If i such that Gi such that G i i 0 Then Go to Step (11) 0 Then Go to Step (11)
(8) If for all t retrieved, Target(Q, t) (8) If for all t retrieved, Target(Q, t) G Then G Then
(9) Find new, lower 0 (9) Find new, lower 0 g g11, ..., g, ..., gmm, G < 1 for C, G < 1 for C
(10) Go to Step (4) (10) Go to Step (4)
(11) Output those objects retrieved with the highest Target score(11) Output those objects retrieved with the highest Target score
![Page 18: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/18.jpg)
Luis GravanoLuis Gravano 1818Stanford UniversityStanford University
Algorithm to Extract Top Algorithm to Extract Top Target ObjectsTarget Objects
QQ11 QQ22
00
11
gg11
gg22
Target(Q, h) G
![Page 19: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/19.jpg)
Luis GravanoLuis Gravano 1919Stanford UniversityStanford University
Algorithm to Extract Top Algorithm to Extract Top Target ObjectsTarget Objects
QQ11 QQ22
00
11
gg11’’gg22’’
Target(Q, h) G’
Target(Q, h’) G’!h’
![Page 20: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/20.jpg)
Luis GravanoLuis Gravano 2020Stanford UniversityStanford University
Preliminary Performance Preliminary Performance Results for our AlgorithmResults for our Algorithm
• Target=MinTarget=Min: 14% objects retrieved: 14% objects retrieved
• Target=MaxTarget=Max: 4% objects retrieved : 4% objects retrieved
10,000 objects10,000 objects4 query attributes4 query attributes
=0=0
![Page 21: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/21.jpg)
Luis GravanoLuis Gravano 2121Stanford UniversityStanford University
Preliminary Performance Preliminary Performance Results for our AlgorithmResults for our Algorithm
• Target=MinTarget=Min: 25% objects retrieved: 25% objects retrieved
• Target=MaxTarget=Max: 44% objects retrieved : 44% objects retrieved
10,000 objects10,000 objects4 query attributes4 query attributes
=0.10=0.10
![Page 22: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/22.jpg)
Luis GravanoLuis Gravano 2222Stanford UniversityStanford University
Having a Having a Manageable CoverManageable Cover for a for a Query is Also Query is Also NecessaryNecessary......
No Manageable Cover No Manageable Cover for query Q at source Sfor query Q at source S
Efficient ExecutionsEfficient ExecutionsImpossible at SImpossible at S
![Page 23: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/23.jpg)
Luis GravanoLuis Gravano 2323Stanford UniversityStanford University
A Manageable Cover is Necessary: A Manageable Cover is Necessary: ProofProof
Consider QConsider Q11, Q, Q22, Q, Q33 minimal cover for Q with: minimal cover for Q with:
QQ11, Q, Q22 manageable, manageable, QQ33 not manageable not manageable
For For anyany “efficient “execution, build “efficient “execution, build hh such that: such that: • h is not retrieved h is not retrieved • Target(Q, h) > G Target(Q, h) > G = = max{Target(Q, o) | o retrieved}max{Target(Q, o) | o retrieved}
![Page 24: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/24.jpg)
Luis GravanoLuis Gravano 2424Stanford UniversityStanford University
A Manageable Cover is Necessary: A Manageable Cover is Necessary: ProofProof
QQ11 QQ22 QQ33
00
11
gg11
gg22
gg33
![Page 25: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/25.jpg)
h’h’ h’h’
h’h’
Target(Q, h’) > G!Target(Q, h’) > G!
![Page 26: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/26.jpg)
h’ hh h’ hh
h’
Target(Q, h) > G!Target(Q, h) > G!
hh
Target(QTarget(Q33, h) , h) Target(Q, h’)Target(Q, h’)Target(Q, h’) > GTarget(Q, h’) > G
![Page 27: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/27.jpg)
Luis GravanoLuis Gravano 2727Stanford UniversityStanford University
We Studied Two We Studied Two Metasearching ProblemsMetasearching Problems
• Extracting the top objectsExtracting the top objects from from the underlying sourcesthe underlying sources
• Merging the resultsMerging the results from the from the various sourcesvarious sources
![Page 28: Merging Ranks from Heterogeneous Internet Sources Hector Garcia-Molina Luis Gravano Stanford University](https://reader030.vdocument.in/reader030/viewer/2022032704/56649d4a5503460f94a27c53/html5/thumbnails/28.jpg)
Luis GravanoLuis Gravano 2828Stanford UniversityStanford University
Related Work:Related Work:Collection Fusion Collection Fusion
•Voorhees et al.Voorhees et al.
•Callan/Lu/CroftCallan/Lu/Croft
•Gauch/WangGauch/Wang