metasearch mathematics of knowledge and search engines: tutorials @ ipam 9/13/2007 zhenyu (victor)...
TRANSCRIPT
MetasearchMathematics of Knowledge and Search Engines: Tutorials
@ IPAM9/13/2007
Zhenyu (Victor) Liu
Software Engineer
Google Inc.
2
Roadmap The problem Database content modeling Database selection Summary
3
Metasearch – the problem
??? appliedmathematics
??? appliedmathematics
Search results
MetasearchEngine
4
Subproblems Database content modeling
How does a Metasearch engine “perceive” the content of each database?
Database selection Selectively issue the query to the “best” databases
Query translation Different database has different query formats
“a+b” / “a AND b” / “title:a AND body:b” / etc.
Result merging Query “applied mathematics” top-10 results from both science.com and nature.com,
how to present?
5
Database content modeling and selection: a simplified example A “content summary” of each database Selection based on # of mathing docs Assuming independence between words
Word w # of documents that use w
Pr(w)
applied 4000 0.4
mathematics 2500 0.25
Total #: 10,000
10,000 0.4 0.25 = 1000
documents matches“applied mathematics”
>
Total #: 60,000
60,000 0.00333 0.005 = 1documents matches “applied mathematics”
Word w # of documents that use w
Pr(w)
applied 200 0.00333
mathematics 300 0.005
6
Roadmap The problem Database content modeling Database selection Summary
7
Database content modeling
able to replicate theentire text database- most storage demanding- fully cooperative database
download part of atext database- more storage demanding- non-cooperative database
able to obtain a fullcontent summary- less storage demanding- fully cooperative database
approximate the contentsummary via sampling- least storage demanding- non-cooperative database
8
Replicate the entire database E.g.
www.google.com/patents, replica of the entire USPTO patent document database
9
Download a non-cooperative database Objective: download as much as possible Basic idea: “probing” (querying with short queries) and
downloading all results Practically, can only issue a fixed # of probes (e.g., 1000
queries per day)
MetasearchEngine
SearchInterface
“applied”
“mathematics”
A textdatabase
10
Harder than the “set-coverage” problem All docs in a database db as the universe
assuming all docs are equal Each probe corresponds to a subset Find the least # of subsets
(probes) that covers db or, the max coverage with a
fixed # of subsets (probes) NP-complete
Greedy algo. proved to be thebest-possible P-timeapproximation algo.
Cardinality of each subset(# of matching docs for eachprobe) unknown!
“mathematics”
“applied”
11
Pseudo-greedy algorithms [NPC05] Greedy-set-coverage: choose subsets with the
max “cardinality gain” When cardinality of subsets is unknown
Assume cardinality of subsets is the same across databases - proportionally
e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency
Start with certain “seed” queries, adaptively choose query words within the docs returned
Choice of probing words varies from database to database
12
An adaptive method D(wi) – subsets returned by probe with word wi
w1, w2, …, wn already issued
Rewritten as|db|Pr(wi+1) - |db|Pr(wi+1 Λ (w1V…V wn))
Pr(w): prob. of w appearing in a doc of db
)()(maxarg1
1
by used worda as 1
1
i
n
ii
)D(ww
wDwDi
n
ii
13
An adaptive method (cont’d) How to estimate Pr ̃(wi+1) Zipf’s law:
Pr(w) = α(R(w)+β)-γ, R(w): rank of w in a descending order of
Pr(w)
Assuming the ranking of w1, w2, …, wn and other words remains the same in the downloaded subset and in db
Interpolate:interpolated
“P ̃r(w)”fitted Zipf’s law curve
single words ranked by Pr(w)in the downloaded documents
Pr(w) values for w1, w2, …, wn
14
Obtain an exact content summary C(db) for a database db
Statistics about words in db,e.g., df – document frequency,
Standards and proposals for co-operative databases to follow to export C(db) STARTS [GCM97]
Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc.
SDARTS [GIG01] Initiated by Columbia U.
w df
mathematics 2500
applied 4000
research 1000
15
Approximate the content-summary
Objective: C̃(db) of a database db, with high vocabulary coverage & high accuracy
Basic idea: probing and download sample docs [CC01] Example: df as the content summary statistics
1. Pick a single word as the query, probe the database
2. Download a fraction of results, e.g., top-k
3. If terminating condition unsatisfied, go to 1.
4. Output <w, df̃> based on the sample docs downloaded
16
Vocabulary coverage Can a small sample of docs cover the
vocabulary of a big database? Yes, based on Heap’s law [Hea78]:
|W |= Knβ
n - # of words scanned W - set of distinct words encountered K - constant, typically in [10, 100] β - constant, typically in [0.4, 0.6]
Empirically verified [CC01]
17
Estimate document frequency How to identify the df ̃ of w in the entire database?
w used as a query during sampling: df typically revealed in search results
w’ appearing in the sampled docs: need to estimate df ̃ based on the docs sample
Apply Zipf’s law & interpolate [IG02]1. Rank w and w’ based on their frequency in the sample
2. Curve-fit based on the true df of those w
3. Interpolate the estimated df ̃ of w’ onto the fitted curve
18
What if db changes over time? So does its content summary C(db), and C̃(db) [INC05] Empirical study
152 Web databases, a snapshot downloaded weekly, for 1 year df as the statistics measure Kullback-Leibler divergence
as the “change” measure between the “latest”
snapshot and thesnapshot time t ago
db does change! How do we model
the change? When to resample, and
get a new C ̃(db) ?t
Kullb
ack
-Le
ible
rdiv
erg
ence
19
Model the change KLdb(t) – the KL divergence between the current
C̃(db) and C̃(db, t) time t ago T: time when KLdb(t) exceeds a pre-specified τ Applying principles of Survival Analysis
Survival function Sdb(t) = 1 – Pr(T ≤ t)
Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t)
How to compute hdb(t) and then Sdb(t)?
20
Learn the hdb(t) of database change Cox proportional hazards regression model
ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some predictor variable
Predictors Pre-specified threshold τ Web domain of db, “.com” “.edu” “.gov” “.org” “others”
5 binary “domain variables”
ln( |db| ) avg KLdb(1 week) measured in the training period …
21
Train the Cox model Stratified Cox model being applied
Domain variables didn’t satisfy the Cox proportional assumption
Stratifying on each domain, or, a hbase(t) / Sbase(t) for each domain
Training Sbase(t) for each domain Assuming Weibull distribution, Sbase(t) = e-λtγ
22
Training result
Sbase(t)
t
γ ranges in (0.57, 1.08) Sbase(t) not exponential distribution
23
Training result (cont’d)
A larger db takes less time to have KLdb(t) exceed τ
Databases changes faster during a short period are more likely to change later on
predictor ln( |db| ) avg KLdb(1 week) τ
β value 0.094 6.762 -1.305
24
How to use the trained model? Model gives Sdb(t) likelihood that db “has not
changed much” An update policy to periodically resample each db
Intuitively, maximize ∑db Sdb(t) More precisely
S = limt∞ (1/t)∫0
t [ ∑db Sdb(t) ] dt
A policy: {fdb}, where fdb is the update frequency of db, e.g., 2/week
Subject to practical constraints, e.g., total update cap per week
–
25
Derive an optimal update policy Find {fdb} that maximizes S under the constraint
∑db fdb = F, where F is a global frequency limit Solvable by the Lagrange-multiplier method Sample results:
–
db λ F=4/week F=15/week
tomshardware.com 0.088 1/46 1/5
Usps.com 0.023 1/34 1/12
26
Roadmap The problem Database content modeling Database selection Summary
27
Database selection Select the databases to issue a given query
Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only
Reduces query load in the entire system
Formalization Query q = <w1, …, wm>, databases db1, …, dbn
Rank databases according to their “relevancy score” r(dbi, q) to query q
28
Relevancy score # of matching docs in db Similarity between q and top docs returned by
db Typically vector-space similarity (dot-product)
between q and a doc Sum / Avg of similarities of top-k docs of each db,
e.g., top-10 Sum / Avg of similarities of top docs of each db
exceeding a similarity threshold Relevancy of db as judged by users
Explicit relevance feedback User click behavior data
29
Estimating r(db,q) Typically, r(db, q) unavailable Estimate r̃(db, q) based on C(db), or C̃(db)
30
Estimating r(db,q), example 1 [GGT99] r(db, q): # of matching docs in db Independence assumption:
Query words w1, …, wm appear independently in db
r̃(db, q):
df(db, wj): document frequency of wj in db –
could be df ̃(db, wj) from C̃(db)
∏)(
)(~
q∈w
j
j|db|
db,wdf=|db|×db,qr
31
Estimating r(db,q), example 2 [GGT99] r(db, q):
∑{ddb | sim(d, q)>l} sim(d, q)
d: a doc in db sim(d, q): vector dot-product between d & q
each word in d & q weighted with common tfidf weighting
l: a pre-specified threshold
32
Estimating r(db,q), example 2 (cont’d) Content summary, C(db), required:
df(db, w): doc frequency v(db, w): ∑{ddb} weight of w in d’s vector
<v(db, w1), v(db, w2), …> - “centroid” of the entire db as a “cluster of doc vectors”
–
– –
33
Estimating r(db,q), example 2 (cont’d) l = 0, sum of all q-doc similarity values of db
r(db, q) = ∑{ddb} sim(d, q) r̃(db, q) = r(db, q) =
<v(q,w1), …> <v(db, w1), v(db, w2), …>
v(q, w): weight of w in the query vector
l > 0?
– –
34
Estimating r(db,q), example 2 (cont’d) Assuming uniform weight of w among all docs using w
i.e. weight of w in any doc = v(db, w) / df(db, w)
Highly-correlated query words scenario If df(db, wi) < df(db, wj), every doc using wi also uses wj
Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm)
r̃(db, q) = ∑i=1…p
v(q, wi)v(db, wi) +
df(db, wp) [ ∑j=p+1…m
v(q, wj)v(db, wj)/df(db, wj)]
where p is determined by some criteria [GGT99]
Disjoint query words scenario No doc using wi uses wj
r̃(db, q) = ∑i=1…m | df(db, wi) > 0 Λ v(q, wi)
v(db, wi) / df(db, wi) > l v(q, wi)v(db, wi)
–
––
––
35
Estimating r(db,q), example 2 (cont’d) Ranking of databases based on r̃(db, q)
empirically evaluated [GGT99]
36
A probabilistic model for errors in estimation [LLC04] Any estimation makes errors An error (observed) distribution for each db
distribution of db1 ≠ distribution of db2
Definition of error: relative
),(~),(~-),(
=),(eqdbr
qdbrqdbrqdbrr
37
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 2 0
0
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 2 0
Modeling the errors: a motivating experiment dbPMC: PubMedCentral www.pubmedcentral.nih.gov
Two query sets, Q1 and Q2 (healthcare related) |Q1| = |Q2| = 1000, Q1 Q2 =
Compute err(dbPMC, q) for each sample queryq Q1 or Q2
Further verified through statistical tests (Pearson-χ2)
err(dbPMC, q), q Q1 err(dbPMC, q), q Q2
errorprobabilitydistribution
errorprobabilitydistribution
Q1 Q2
38
Implications of the experiment On a text database
Similar error behavior among sample queries Can sample a database and summarize the error
behavior into an Error Distribution (ED) Use ED to predict the error for a future unseen query
Sampling size study [LLC04] A few hundred sample queries good enough
39
From an Error Distribution (ED)to a Relevancy Distribution (RD) Database: db1. Query: qnew
0.1
0.50.4
-50% 0% +50%
③ r� (db1,qnew) =1000
500 1000 1500
A Relevancy Distribution (RD)for r(db1, qnew)
err(db1,qnew)
r(db1,qnew)
)(~1)+)(()( new1new1new1 q,dbq,dbeq,db rrrr ×=
The ED for db1
①
②
④
0.1
0.50.4
by definition
from sampling
existing estimation method
40
RD-based: db1 < db2
( Pr(db1 < db2) = 0.85 )
RD-based selectionr� (db1,qnew)r� (db2,qnew)
650 1000
Estimation-based: db1 > db2
0.1
0.50.4
-50% 0% +50%
err(db1, qnew)
r� (db1,qnew) =1000
r(db1, qnew)
500 1000 1500
0.1
0.50.4
db1
0.1
0.9
0% +100%
err(db2, qnew)
r� (db1,qnew) =650
r(db2, qnew)db2
0.1
0.9
650 1300db1:
db2:
41
Correctness metric Terminology:
DBk: k databases returned by some method DBtopk: the actual answer
How correct DBk is compared to DBtopk? Absolute correctness: Cora(DBk) = 1, if DBk=DBtopk
0, otherwise
Partial correctness: Corp(DBk) =
Cora(DBk) = Corp(DBk) for k = 1
k
|DBDB| topkk ∩
42
Effectiveness of RD-based selection 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the ED of each database
Q2 (testing, 1000 queries) to test the correctness of database selection
k = 1 k = 3
Avg(Cora), Avg(Corp)
Avg(Cora) Avg(Corp)
Estimation-based selection(term-independence estimator)
0.471 0.301 0.699
RD-based selection 0.651 (+38.2%) 0.478 (+58.8%)
0.815 (+30.9%)
43
Probing to improve correctness RD-based selection
0.85 = Pr(db2 > db1)
= Pr({db2} = DBtop1)
= 1Pr({db2} = DBtop1) +0Pr({db2} DBtop1)
= E[Cora({db2})]
Probe dbi: contact a dbi to obtain its exact relevancy
After probing db1:
E[Cora({db2})] = Pr(db2 > db1) = 1
500 1000 1500
0.1
0.50.4
0.1
0.9
650 1300
db1:
db2:
500
r(db1,q)=500
44
Computing the expected correctness Expected absolute correctness
E[Cora(DBk)]=1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0)= Pr(Cora(DBk) = 1)= Pr(DBk = DBtopk)
Expected partial correctness E[Corp(DBk)]
)=|∩(|•=)=)((•= ∑∑0
0
lDBDBPrk
l
k
lDBCorPr
k
l topkkkp
≤l≤k≤l≤k
45
Adaptive probing algorithm: APro
dbi+1dbidb1
unprobed probed
Any DBk
with E[Cor(DBk)] t?
NO
YES
return this DBk
RD’s of the probed and unprobed databases
dbndbi-1
dbi+1 dbn
User-specified correctness threshold: t
dbi
46
Which database to probe? A greedy strategy:
The stopping condition: E[Cor(DBk)] t
Once probed, which database leads to the highest E[Cor(DBk)]?
Suppose we will probe db3
if r(db3,q) = ra, max E[Cor(DBk)] = 0.85
if r(db3,q) = rb, max E[Cor(DBk)] = 0.8
if r(db3,q) = rc, max E[Cor(DBk)] = 0.9
Probe the database that leads to
the largest “expected”
max E[Cor(DBk)]
db1 db2
db3 db4
r(db3, q) = ra
r(db3, q) = rb
r(db3, q) = rc
rarbrc
47
Effectiveness of adaptive probing 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the RD of each
database Q2 (testing, 1000 queries) to test the correctness of
database selection
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5
# of databases probed
adaptive probing APro
the term-independence estimator
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5# of databases probed
adaptive probing APro
the term-independence estimator
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5# of databases probed
adaptive probing APro
the term-independence estimator
k = 1
avgCora
avgCora
avgCorp
k = 3 k = 3
48
The “lazy TA problem” Same problem, generalized & “humanized” After the final exam, the TA wants to find out the
top scoring students TA is “lazy,” don’t want to score all exam sheets Input: every student’s score: a known distribution
Observed from pervious quiz, mid-term exams
Output: a scoring strategy Maximizes the correctness of the “guessed” top-k
students
49
Further study of this problem [LSC05] Proves greedy probing is optimal under special
cases More interesting factors to-be-explored:
“Optimal” probing strategy in general cases Non-uniform probing cost Time-variant distributions
50
Roadmap The problem Database content modeling Database selection Summary
51
Summary Metasearch – a challenging problem Database content modeling
Sampling enhanced by proper application of the Zipf’s law, the Heap’s law
Content change modeled using Survival Analysis
Database selection Estimation of database relevancy based on assumptions A probabilistic framework that models the error as a distribution “Optimal” probing strategy for a collection of distributions as input
52
References [CC01] J.P. Callan and M. Connell, “Query-Based Sampling of Text
Databases,” ACM Tran. on Information System, 19(2), 2001 [GCM97] L. Gravano, C-C. K. Chang, H. Garcia-Molina, A. Paepcke,
“STARTS: Stanford Proposal for Internet Meta-searching,” in Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, 1997
[GGT99] L. Gravano, H. Garcia-Molina, A. Tomasic, “GlOSS: Text Source Discovery over the Internet,” ACM Tran. on Database Systems, 24(2), 1999
[GIG01] N. Green, P. Ipeirotis, L. Gravano, “SDLIP+STARTS=SDARTS: A Protocol and Toolkit for Metasearching,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), 2001
[Hea78] H.S. Heaps, Information Retrieval: Computational and Teoretical Aspects, Academic Press, 1978
[IG02] P. Ipeirotis, L. Gravano, “Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection,” in Proc. of the 28th VLDB Conf., 2002
53
References (cont’d) [INC05] P. Ipeirotis, A. Ntoulas, J. Cho, L. Gravano, “Modeling and
Managing Content Changes in Text Databases,” in Proc. of the 21st IEEE Int’l Conf. on Data Eng. (ICDE), 2005
[LLC04] Z. Liu, C. Luo, J. Cho, W.W. Chu, “A Probabilistic Approach to Metasearching with Adaptive Probing,” in Proc. of the 20th IEEE Int’l Conf. on Data Eng. (ICDE), 2004
[LSC05] Z. Liu, K.C. Sia, J. Cho, “Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty,” in Proc. of ACM Annual Symposium on Applied Computing, 2005
[NPC05] A. Ntoulas, P. Zerfos, J. Cho, “Downloading Hidden Web Content,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), June 2005