metasearch mathematics of knowledge and search engines: tutorials @ ipam 9/13/2007 zhenyu (victor)...

MetasearchMathematics of Knowledge and Search Engines: Tutorials

@ IPAM9/13/2007

Zhenyu (Victor) Liu

Software Engineer

Google Inc.

[email protected]

2

Roadmap The problem Database content modeling Database selection Summary

3

Metasearch – the problem

??? appliedmathematics

??? appliedmathematics

Search results

MetasearchEngine

4

Subproblems Database content modeling

How does a Metasearch engine “perceive” the content of each database?

Database selection Selectively issue the query to the “best” databases

Query translation Different database has different query formats

“a+b” / “a AND b” / “title:a AND body:b” / etc.

Result merging Query “applied mathematics” top-10 results from both science.com and nature.com,

how to present?

5

Database content modeling and selection: a simplified example A “content summary” of each database Selection based on # of mathing docs Assuming independence between words

Word w # of documents that use w

Pr(w)

applied 4000 0.4

mathematics 2500 0.25

Total #: 10,000

10,000 0.4 0.25 = 1000

documents matches“applied mathematics”

>

Total #: 60,000

60,000 0.00333 0.005 = 1documents matches “applied mathematics”

Word w # of documents that use w

Pr(w)

applied 200 0.00333

mathematics 300 0.005

6


7

Database content modeling

able to replicate theentire text database- most storage demanding- fully cooperative database

download part of atext database- more storage demanding- non-cooperative database

able to obtain a fullcontent summary- less storage demanding- fully cooperative database

approximate the contentsummary via sampling- least storage demanding- non-cooperative database

8

Replicate the entire database E.g.

www.google.com/patents, replica of the entire USPTO patent document database

9

Download a non-cooperative database Objective: download as much as possible Basic idea: “probing” (querying with short queries) and

downloading all results Practically, can only issue a fixed # of probes (e.g., 1000

queries per day)

MetasearchEngine

SearchInterface

“applied”

“mathematics”

A textdatabase

10

Harder than the “set-coverage” problem All docs in a database db as the universe

assuming all docs are equal Each probe corresponds to a subset Find the least # of subsets

(probes) that covers db or, the max coverage with a

fixed # of subsets (probes) NP-complete

Greedy algo. proved to be thebest-possible P-timeapproximation algo.

Cardinality of each subset(# of matching docs for eachprobe) unknown!

“mathematics”

“applied”

11

Pseudo-greedy algorithms [NPC05] Greedy-set-coverage: choose subsets with the

max “cardinality gain” When cardinality of subsets is unknown

Assume cardinality of subsets is the same across databases - proportionally

e.g. build a database with Web pages crawled from the Internet, rank single words according to their frequency

Start with certain “seed” queries, adaptively choose query words within the docs returned

Choice of probing words varies from database to database

12

An adaptive method D(wi) – subsets returned by probe with word wi

w1, w2, …, wn already issued

Rewritten as|db|Pr(wi+1) - |db|Pr(wi+1 Λ (w1V…V wn))

Pr(w): prob. of w appearing in a doc of db

)()(maxarg1

1

by used worda as 1

1

i

n

ii

)D(ww

wDwDi

n

ii

13

An adaptive method (cont’d) How to estimate Pr ̃(wi+1) Zipf’s law:

Pr(w) = α(R(w)+β)-γ, R(w): rank of w in a descending order of

Pr(w)

Assuming the ranking of w1, w2, …, wn and other words remains the same in the downloaded subset and in db

Interpolate:interpolated

“P ̃r(w)”fitted Zipf’s law curve

single words ranked by Pr(w)in the downloaded documents

Pr(w) values for w1, w2, …, wn

14

Obtain an exact content summary C(db) for a database db

Statistics about words in db,e.g., df – document frequency,

Standards and proposals for co-operative databases to follow to export C(db) STARTS [GCM97]

Initiated by Stanford, attracted main search engine players by 1997: Fulcrum, Infoseek, PLS, Verity, WAIS, Excite, etc.

SDARTS [GIG01] Initiated by Columbia U.

w df

mathematics 2500

applied 4000

research 1000

15

Approximate the content-summary

Objective: C̃(db) of a database db, with high vocabulary coverage & high accuracy

Basic idea: probing and download sample docs [CC01] Example: df as the content summary statistics

1. Pick a single word as the query, probe the database

2. Download a fraction of results, e.g., top-k

3. If terminating condition unsatisfied, go to 1.

4. Output <w, df̃> based on the sample docs downloaded

16

Vocabulary coverage Can a small sample of docs cover the

vocabulary of a big database? Yes, based on Heap’s law [Hea78]:

|W |= Knβ

n - # of words scanned W - set of distinct words encountered K - constant, typically in [10, 100] β - constant, typically in [0.4, 0.6]

Empirically verified [CC01]

17

Estimate document frequency How to identify the df ̃ of w in the entire database?

w used as a query during sampling: df typically revealed in search results

w’ appearing in the sampled docs: need to estimate df ̃ based on the docs sample

Apply Zipf’s law & interpolate [IG02]1. Rank w and w’ based on their frequency in the sample

2. Curve-fit based on the true df of those w

3. Interpolate the estimated df ̃ of w’ onto the fitted curve

18

What if db changes over time? So does its content summary C(db), and C̃(db) [INC05] Empirical study

152 Web databases, a snapshot downloaded weekly, for 1 year df as the statistics measure Kullback-Leibler divergence

as the “change” measure between the “latest”

snapshot and thesnapshot time t ago

db does change! How do we model

the change? When to resample, and

get a new C ̃(db) ?t

Kullb

ack

-Le

ible

rdiv

erg

ence

19

Model the change KLdb(t) – the KL divergence between the current

C̃(db) and C̃(db, t) time t ago T: time when KLdb(t) exceeds a pre-specified τ Applying principles of Survival Analysis

Survival function Sdb(t) = 1 – Pr(T ≤ t)

Hazard funciton hdb(t) = - (dSdb(t) /dt) / Sdb(t)

How to compute hdb(t) and then Sdb(t)?

20

Learn the hdb(t) of database change Cox proportional hazards regression model

ln( hdb(t) ) = ln( hbase(t) ) + β1x1 + … , where xi is some predictor variable

Predictors Pre-specified threshold τ Web domain of db, “.com” “.edu” “.gov” “.org” “others”

5 binary “domain variables”

ln( |db| ) avg KLdb(1 week) measured in the training period …

21

Train the Cox model Stratified Cox model being applied

Domain variables didn’t satisfy the Cox proportional assumption

Stratifying on each domain, or, a hbase(t) / Sbase(t) for each domain

Training Sbase(t) for each domain Assuming Weibull distribution, Sbase(t) = e-λtγ

22

Training result

Sbase(t)

t

γ ranges in (0.57, 1.08) Sbase(t) not exponential distribution

23

Training result (cont’d)

A larger db takes less time to have KLdb(t) exceed τ

Databases changes faster during a short period are more likely to change later on

predictor ln( |db| ) avg KLdb(1 week) τ

β value 0.094 6.762 -1.305

24

How to use the trained model? Model gives Sdb(t) likelihood that db “has not

changed much” An update policy to periodically resample each db

Intuitively, maximize ∑db Sdb(t) More precisely

S = limt∞ (1/t)∫0

t [ ∑db Sdb(t) ] dt

A policy: {fdb}, where fdb is the update frequency of db, e.g., 2/week

Subject to practical constraints, e.g., total update cap per week

–

25

Derive an optimal update policy Find {fdb} that maximizes S under the constraint

∑db fdb = F, where F is a global frequency limit Solvable by the Lagrange-multiplier method Sample results:

–

db λ F=4/week F=15/week

tomshardware.com 0.088 1/46 1/5

Usps.com 0.023 1/34 1/12

26


27

Database selection Select the databases to issue a given query

Necessary when the Metasearch engine do not have entire replica of each database – most likely with content summary only

Reduces query load in the entire system

Formalization Query q = <w1, …, wm>, databases db1, …, dbn

Rank databases according to their “relevancy score” r(dbi, q) to query q

28

Relevancy score # of matching docs in db Similarity between q and top docs returned by

db Typically vector-space similarity (dot-product)

between q and a doc Sum / Avg of similarities of top-k docs of each db,

e.g., top-10 Sum / Avg of similarities of top docs of each db

exceeding a similarity threshold Relevancy of db as judged by users

Explicit relevance feedback User click behavior data

29

Estimating r(db,q) Typically, r(db, q) unavailable Estimate r̃(db, q) based on C(db), or C̃(db)

30

Estimating r(db,q), example 1 [GGT99] r(db, q): # of matching docs in db Independence assumption:

Query words w1, …, wm appear independently in db

r̃(db, q):

df(db, wj): document frequency of wj in db –

could be df ̃(db, wj) from C̃(db)

∏)(

)(~

q∈w

j

j|db|

db,wdf=|db|×db,qr

31

Estimating r(db,q), example 2 [GGT99] r(db, q):

∑{ddb | sim(d, q)>l} sim(d, q)

d: a doc in db sim(d, q): vector dot-product between d & q

each word in d & q weighted with common tfidf weighting

l: a pre-specified threshold

32

Estimating r(db,q), example 2 (cont’d) Content summary, C(db), required:

df(db, w): doc frequency v(db, w): ∑{ddb} weight of w in d’s vector

<v(db, w1), v(db, w2), …> - “centroid” of the entire db as a “cluster of doc vectors”

–

– –

33

Estimating r(db,q), example 2 (cont’d) l = 0, sum of all q-doc similarity values of db

r(db, q) = ∑{ddb} sim(d, q) r̃(db, q) = r(db, q) =

<v(q,w1), …> <v(db, w1), v(db, w2), …>

v(q, w): weight of w in the query vector

l > 0?

– –

34

Estimating r(db,q), example 2 (cont’d) Assuming uniform weight of w among all docs using w

i.e. weight of w in any doc = v(db, w) / df(db, w)

Highly-correlated query words scenario If df(db, wi) < df(db, wj), every doc using wi also uses wj

Words in q sorted s.t. df(db, w1) ≤ df(db, w2) ≤ … ≤ df(db, wm)

r̃(db, q) = ∑i=1…p

v(q, wi)v(db, wi) +

df(db, wp) [ ∑j=p+1…m

v(q, wj)v(db, wj)/df(db, wj)]

where p is determined by some criteria [GGT99]

Disjoint query words scenario No doc using wi uses wj

r̃(db, q) = ∑i=1…m | df(db, wi) > 0 Λ v(q, wi)

v(db, wi) / df(db, wi) > l v(q, wi)v(db, wi)

–

––

––

35

Estimating r(db,q), example 2 (cont’d) Ranking of databases based on r̃(db, q)

empirically evaluated [GGT99]

36

A probabilistic model for errors in estimation [LLC04] Any estimation makes errors An error (observed) distribution for each db

distribution of db1 ≠ distribution of db2

Definition of error: relative

),(~),(~-),(

=),(eqdbr

qdbrqdbrqdbrr

37

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 2 0

0

0 .1

0 .2

0 .3

0 .4

0 .5

0 .6

0 .7

1 2 3 4 5 6 7 8 9 1 0 1 1 1 1 1 1 1 1 1 2 0

Modeling the errors: a motivating experiment dbPMC: PubMedCentral www.pubmedcentral.nih.gov

Two query sets, Q1 and Q2 (healthcare related) |Q1| = |Q2| = 1000, Q1 Q2 =

Compute err(dbPMC, q) for each sample queryq Q1 or Q2

Further verified through statistical tests (Pearson-χ2)

err(dbPMC, q), q Q1 err(dbPMC, q), q Q2

errorprobabilitydistribution

errorprobabilitydistribution

Q1 Q2

38

Implications of the experiment On a text database

Similar error behavior among sample queries Can sample a database and summarize the error

behavior into an Error Distribution (ED) Use ED to predict the error for a future unseen query

Sampling size study [LLC04] A few hundred sample queries good enough

39

From an Error Distribution (ED)to a Relevancy Distribution (RD) Database: db1. Query: qnew

0.1

0.50.4

-50% 0% +50%

③ r� (db1,qnew) =1000

500 1000 1500

A Relevancy Distribution (RD)for r(db1, qnew)

err(db1,qnew)

r(db1,qnew)

)(~1)+)(()( new1new1new1 q,dbq,dbeq,db rrrr ×=

The ED for db1

①

②

④

0.1

0.50.4

by definition

from sampling

existing estimation method

40

RD-based: db1 < db2

( Pr(db1 < db2) = 0.85 )

RD-based selectionr� (db1,qnew)r� (db2,qnew)

650 1000

Estimation-based: db1 > db2

0.1

0.50.4

-50% 0% +50%

err(db1, qnew)

r� (db1,qnew) =1000

r(db1, qnew)

500 1000 1500

0.1

0.50.4

db1

0.1

0.9

0% +100%

err(db2, qnew)

r� (db1,qnew) =650

r(db2, qnew)db2

0.1

0.9

650 1300db1:

db2:

41

Correctness metric Terminology:

DBk: k databases returned by some method DBtopk: the actual answer

How correct DBk is compared to DBtopk? Absolute correctness: Cora(DBk) = 1, if DBk=DBtopk

0, otherwise

Partial correctness: Corp(DBk) =

Cora(DBk) = Corp(DBk) for k = 1

k

|DBDB| topkk ∩

42

Effectiveness of RD-based selection 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the ED of each database

Q2 (testing, 1000 queries) to test the correctness of database selection

k = 1 k = 3

Avg(Cora), Avg(Corp)

Avg(Cora) Avg(Corp)

Estimation-based selection(term-independence estimator)

0.471 0.301 0.699

RD-based selection 0.651 (+38.2%) 0.478 (+58.8%)

0.815 (+30.9%)

43

Probing to improve correctness RD-based selection

0.85 = Pr(db2 > db1)

= Pr({db2} = DBtop1)

= 1Pr({db2} = DBtop1) +0Pr({db2} DBtop1)

= E[Cora({db2})]

Probe dbi: contact a dbi to obtain its exact relevancy

After probing db1:

E[Cora({db2})] = Pr(db2 > db1) = 1

500 1000 1500

0.1

0.50.4

0.1

0.9

650 1300

db1:

db2:

500

r(db1,q)=500

44

Computing the expected correctness Expected absolute correctness

E[Cora(DBk)]=1Pr(Cora(DBk) = 1) + 0Pr(Cora(DBk) = 0)= Pr(Cora(DBk) = 1)= Pr(DBk = DBtopk)

Expected partial correctness E[Corp(DBk)]

)=|∩(|•=)=)((•= ∑∑0

0

lDBDBPrk

l

k

lDBCorPr

k

l topkkkp

≤l≤k≤l≤k

45

Adaptive probing algorithm: APro

dbi+1dbidb1

unprobed probed

Any DBk

with E[Cor(DBk)] t?

NO

YES

return this DBk

RD’s of the probed and unprobed databases

dbndbi-1

dbi+1 dbn

User-specified correctness threshold: t

dbi

46

Which database to probe? A greedy strategy:

The stopping condition: E[Cor(DBk)] t

Once probed, which database leads to the highest E[Cor(DBk)]?

Suppose we will probe db3

if r(db3,q) = ra, max E[Cor(DBk)] = 0.85

if r(db3,q) = rb, max E[Cor(DBk)] = 0.8

if r(db3,q) = rc, max E[Cor(DBk)] = 0.9

Probe the database that leads to

the largest “expected”

max E[Cor(DBk)]

db1 db2

db3 db4

r(db3, q) = ra

r(db3, q) = rb

r(db3, q) = rc

rarbrc

47

Effectiveness of adaptive probing 20 healthcare-related text databases on the Web Q1 (training, 1000 queries) to learn the RD of each

database Q2 (testing, 1000 queries) to test the correctness of

database selection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5

# of databases probed

adaptive probing APro

the term-independence estimator

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5# of databases probed



0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5# of databases probed



k = 1

avgCora

avgCora

avgCorp

k = 3 k = 3

48

The “lazy TA problem” Same problem, generalized & “humanized” After the final exam, the TA wants to find out the

top scoring students TA is “lazy,” don’t want to score all exam sheets Input: every student’s score: a known distribution

Observed from pervious quiz, mid-term exams

Output: a scoring strategy Maximizes the correctness of the “guessed” top-k

students

49

Further study of this problem [LSC05] Proves greedy probing is optimal under special

cases More interesting factors to-be-explored:

“Optimal” probing strategy in general cases Non-uniform probing cost Time-variant distributions

50


51

Summary Metasearch – a challenging problem Database content modeling

Sampling enhanced by proper application of the Zipf’s law, the Heap’s law

Content change modeled using Survival Analysis

Database selection Estimation of database relevancy based on assumptions A probabilistic framework that models the error as a distribution “Optimal” probing strategy for a collection of distributions as input

52

References [CC01] J.P. Callan and M. Connell, “Query-Based Sampling of Text

Databases,” ACM Tran. on Information System, 19(2), 2001 [GCM97] L. Gravano, C-C. K. Chang, H. Garcia-Molina, A. Paepcke,

“STARTS: Stanford Proposal for Internet Meta-searching,” in Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, 1997

[GGT99] L. Gravano, H. Garcia-Molina, A. Tomasic, “GlOSS: Text Source Discovery over the Internet,” ACM Tran. on Database Systems, 24(2), 1999

[GIG01] N. Green, P. Ipeirotis, L. Gravano, “SDLIP+STARTS=SDARTS: A Protocol and Toolkit for Metasearching,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), 2001

[Hea78] H.S. Heaps, Information Retrieval: Computational and Teoretical Aspects, Academic Press, 1978

[IG02] P. Ipeirotis, L. Gravano, “Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection,” in Proc. of the 28th VLDB Conf., 2002

53

References (cont’d) [INC05] P. Ipeirotis, A. Ntoulas, J. Cho, L. Gravano, “Modeling and

Managing Content Changes in Text Databases,” in Proc. of the 21st IEEE Int’l Conf. on Data Eng. (ICDE), 2005

[LLC04] Z. Liu, C. Luo, J. Cho, W.W. Chu, “A Probabilistic Approach to Metasearching with Adaptive Probing,” in Proc. of the 20th IEEE Int’l Conf. on Data Eng. (ICDE), 2004

[LSC05] Z. Liu, K.C. Sia, J. Cho, “Cost-Efficient Processing of Min/Max Queries over Distributed Sensors with Uncertainty,” in Proc. of ACM Annual Symposium on Applied Computing, 2005

[NPC05] A. Ntoulas, P. Zerfos, J. Cho, “Downloading Hidden Web Content,” in Proc. of the Joint Conf. on Digital Libraries (JCDL), June 2005

metasearch mathematics of knowledge and search engines: tutorials @ ipam 9/13/2007 zhenyu (victor)...

Documents

entire text database

database db

database selectionselectively

query words

cooperative databasedownload

rank of w

applied mathematicsword

setcoverage problemall