sourcerank: relevance and trust assessment for deep web sources based on inter-source agreement raju...
Post on 30-Dec-2015
220 Views
Preview:
TRANSCRIPT
SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on
Inter-Source Agreement
Raju Balakrishnan, Subbarao Kambhampati
Arizona State University
Funding from
2
Deep Web Integration Scenario
Web DB
Mediator
←query
Web DB
Web DB
Web DB
Web DB
Millions of sources containing structured tuples
Uncontrolled collection of redundant information
answer tu
ples→
answ
er tu
ples
→
answ
er tu
ples
→
←answer tuples
←answer tuples
←qu
ery
←qu
ery
query→query→
Deep Web
Search engines have nominal access. We don’t Google for a “Honda Civic 2008 Tampa”
3
Why Another Ranking?
Example Query: “Godfather Trilogy” on Google Base
Importance: Searching for titles matching with the query. None of the results are the classic Godfather
Rankings are oblivious to result Importance & Trustworthiness
Trustworthiness (bait and switch)The titles and cover image match
exactly. Prices are low. Amazing deal! But when you proceed towards
check out you realize that the product is a different one! (or when you open the mail package, if you are really unlucky)
4
Agenda
1.Problem Definition2.SourceRank: Ranking based
on Agreement3.Computing Agreement4.Computing Source Collusion5.System implementation and
Results
5
Problem: Given a user query, select a subset of sources to provide important and trustworthy answers.
Surface web search combines link analysis with Query-Relevance to consider trustworthiness and relevance of the results.
Unfortunately, deep web records do not have hyper-links.
Source Selection in the Deep Web
6
Observations Many sources return answers to the same
query. Comparison of semantics of the answers is
facilitated by structure of the tuples.
Idea: Compute importance and trustworthiness
of sources based on the agreement of
answers returned by different sources.
Source Agreement
7
Agreement Implies Trust & Importance.
Important results are likely to be returned by a large number of sources. e.g. For the query “Godfather” hundreds of
sources return the classic “The Godfather” while a few sources return the little known movie “Little Godfather”.
Two independent sources are not likely to agree upon corrupt/untrustworthy answers.e.g. The wrong author of the book (e.g.
Godfather author as “Nino Rota”) would not be agreed by other sources. As we know, truth is one (or a few), but lies are many.
9
Agreement Implies Trust & Relevance
Probability of agreement of two independently selected irrelevant/false tuples is
||
1),( 21
UffPa
Probability of agreement or two independently picked relevant and true tuples is
||
1),( 21
Ta
RrrP
),(),(|||| 2121 ffPrrPRU aaT
k100
1
3
1
10
S2
S1
0.14
0.86
0.78
0.4
S3
0.6
0.22
Method: Sampling based Agreement
Link semantics from Si to Sj with weight w: Si acknowledges w fraction of tuples in Sj. Since weight is the fraction, links are unsymmetrical.
||
),()1()(
2
2121
R
RRASSW
where induces the smoothing links to account for the unseen samples. R1, R2 are the result sets of S1, S2.
Agreement is computed using key word queries.
Partial titles of movies/books are used as queries.
Mean agreement over all the queries are used as the final agreement.
11
Method: Calculating SourceRankHow can I use the agreement graph for improved search?
• Source graph is viewed as a markov chain, with edges as the transition probabilities between the sources.
• The prestige of sources considering transitive nature of the agreement may be computed based on a markov random walk.
SourceRank is equal to this stationary visit probability of the random walk on the database vertex.
This static SourceRank may be combined with a query-specific source-relevance measure for the final ranking.
12
Computing Agreement is Hard
Computing semantic agreement between two records is the record linkage problem, and is known to be hard.
Semantically same entities may be represented syntactically differently by two databases (non-common domains).
Godfather, The: The Coppola Restoration
James Caan /Marlon Brando more
$9.99
Marlon Brando, Al Pacino
13.99 USD
The Godfather - The Coppola Restoration Giftset [Blu-ray]
Example “Godfather” tuples from two web sources. Note that titles and castings are denoted differently.
13
Method: Computing AgreementAgreement Computation has Three levels.1. Comparing Attribute-Value
Soft-TFIDF with Jaro-Winkler as the similarity measure is used. 2. Comparing Records. We do not assume predefined schema matching. Instance of a bipartite
matching problem. Optimal matching is .
Greedy matching is used. Values are greedily matched against most similar value in the other record.
The attribute importance are weighted by IDF. (e.g. same titles (Godfather) is more important than same format (paperback))
3. Comparing result sets. Using the record similarity computed above, result set similarities
are computed using the same greedy approach.
)( 3vO
)( 2vO
14
Detecting Source Collusion
Observation 1: Even non-colluding sources in the same domain may contain same data. e.g. Movie databases may contain all Hollywood movies. Observation 2: Top-k answers of even non-colluding sources may be similar.e.g. Answers to query “Godfather” may contain all the three movies in the Godfather trilogy.
The sources may copy data from each other, or make mirrors, boosting SourceRank of the group.
15
Source Collusion--Continued
Basic Method: If two sources return same top-k answers to the queries with large number of answers (e.g. queries like “the” or “DVD”) they are likely to be colluding.
We compute the degree of collusion of sources as the agreement on large answer queries.
Words with highest DF in the crawl is used as the queries.
The agreement between two databases are adjusted for collusion by multiplying by
(1-collusion).
16
Factal: Search based on SourceRank
http://factal.eas.asu.edu
”I personally ran a handful of test queries this way and gotmuch better results [than Google Products] results using Factal” --- Anonymous WWW’11 Reviewer.
17
Evaluation Precision and DCG are compared with the following baseline methods
1) CORI: Adapted from text database selection. Union of sample documents from sources are indexed and sources with highest number term hits are selected [Callan et al. 1995].
2) Coverage: Adapted from relational databases. Mean relevance of the top-5 results to the sampling queries [Nie et al. 2004].
3) Google Products: Products Search that is used over Google Base
All experiments distinguish the SourceRank from baseline methods with 0.95 confidence levels.
18
Online Top-4 Sources-Movies
Cover
age
Sour
ceRan
kCORI
SR-C
over
age
SR-C
ORI
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45 PrecisionDCG
29%
Though combinations are not our competitors, note that they are not better:1.SourceRank implicitly considers query relevance, as selected sources fetch answers by query similarity. Combining again with query similarity may be an “overweighting”.2. Search is Vertical
19
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35 Precision
DCG
Online Top-4 Sources-Books
48%
20
Google Base Top-5 Precision-Books
0
0.1
0.2
0.3
0.4
0.5
24% 675 Google Base
sources responding to a set of book queries are used as the book domain sources.
GBase-Domain is the Google Base searching only on these 675 domain sources.
Source Selection by SourceRank (coverage) followed by ranking by Google Base.
675 Sources
21
Gbase Gbase-Domain SourceRank Coverage0
0.05
0.1
0.15
0.2
0.25
209 Sources
Google Base Top-5 Precision-Movies
25%
22
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-10
0
10
20
30
40
50
60
Corruption Level
Dec
reas
e in
Ran
k(%
)
SourceRankCoverageCORI
Trustworthiness of Source Selection
Google Base Movies1. Corrupted the results in sample
crawl by replacing attribute vales not specified in the queries with random strings (since partial titles are the queries, we corrupted attributes except titles).
2.If the source selection is sensitive to corruption, the ranks should decrease with the corruption levels.
Every relevance measure based on query-similarity are oblivious to the corruption of attributes unspecified in queries.
23
Trustworthiness- Google Base Books
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-5
0
5
10
15
20
25
30
35
40
45
Corruption Level
Dec
reas
e in
Ran
k(%
)
SourceRankCoverageCORI
24
00.10.20.30.40.50.60.70.80.910
0.2
0.4
0.6
0.8
1
Rank Correlation
CollusionAgreementAdjusted Agreement
Collusion—Ablation StudyTwo database with the
same one million tuples from IMDB are created.
Correlation between the ranking functions reduced increasingly.
Natural agreement will be preserved while catching near-mirrors.
Observations: 1. At high correlation the
adjusted agreement is very low.
2. Adjusted agreement is almost the same as the pure agreement at low correlations.
25
Computation TimeRandom walk is
known to be feasible in large scale.
Time to compute the agreements is evaluated against number of sources.
Note that the computation is offline.
Easy to parallelize.
26
Contributions
1. Agreement based trust assessment for the deep web
2. Agreement based relevance assessment for the deep web
3. Collusion detection between the web sources
4. Evaluations in Google Base sources and online web databases
The search using SourceRank is demonstrated on Friday: 10-15:30
top related