query routing in peer-to-peer web search engine speaker: pavel serdyukov supervisors: gerhard weikum...

Query Routing in Peer-to-Peer

Web Search Engine

Speaker: Pavel SerdyukovSupervisors: Gerhard Weikum

Christian ZimmerMatthias Bender

International Max Planck

Research School for Computer

Science

2

Talk Outline

Motivation Proposed Search Engine architecture Query routing and database selection

Similarity-based measures Example: GlOSS

Document-frequency-based measures Example: CORI

Evaluation of methods Proposals Conclusion

3

Problems of present Web Search Engines

Size of indexable Web: Web is huge, it’s difficult to cover all Timely re-crawls are required Technical limits

Deep Web Monopoly of Google:

Controls 80% of web search requests Paid sites get updated more frequently and get higher

rating Sites may be censored by engine

4

Make use of Peer-to-Peer technology

Peer 3Peer 1Peer 2

Peer 4Peer 2Peer 1

Peer 1Peer 4Peer 3

Peer 2Peer 3Peer 4

cancerelephantcomputer

Ranking of peer usefulness (richness) for keyword

Global directory must be shared among peers!

Exploit previously unused CPU/memory/disk power Provide up-to-date results for small portions of Web Conquer Deep Web by personalized and

specialized web crawlers

Chord Ring

0

4

26

5

1

3

7Global

Directory

5

Query routing

Goal: find peers with relevant documents Known before as Database Selection Problem

Not all techniques are applicable to P2P

query

6

Database Selection Problem

1st inference: Is this document relevant? It’s a subjective user judgment, we model it We use only representations of user needs and

documents (keywords, inverted indices)

2nd inference: Database is potential to satisfy query, if it has many documents (size-based naive approach) has many documents, containing all query words

high number of them with given similarity high summarized similarity of them

7

Measuring usefulness

Number of documents with all query words is unknown

no full document representations available, only database summaries (representatives)

3rd Inference (usefulness) is built on top of previous two

Steps of database selectioni. Rely on sensible 1st and 2nd inferences

ii. Choose database representatives for 3rd inference

iii. Calculate usefulness measures

iv. Choose most useful databases

8

Similarity-based measures

Definition: Usefulness is a sum of document similarities, exceeding threshold l

Simplest: summarized weight of query terms across collection no assumptions about word cooccurrence l = 0

ldocqsimDBdoc

docqsimDBqlUsefulness),(

),(),,(

9

GlOSS

High correlation assumption: Sort all n query terms Ti in descendant order of their DF’s

DFn → Tn , Tn-1 , … , T1 , DFn-1 – DFn → Tn-1 , Tn-2 , … , T1 , … , DF1 – DF2 → T1

Use averaged term weights to calculate document similarity

l > 0 l is query dependent l is collection dependent

Usually because of local IDF’s difference Proposal: use global term importance

Usually l is set to 0 in experiments

10

Problems of similarity-based measures

Is this inference good?

A few high-scored documents and a lot of low scored documents are regarded as equal Proposal: summarize first K similarities

Highly scored documents could be bad indicator of usefulness Most of relevant documents have moderate scores Highly scored documents could be non-relevant

11

Document frequency based measures

Don’t use term frequencies (actual similarities) Exploit document frequencies only Exploit global measure of term importance

Average IDF ICF (inversed collection frequency) =

Main assumption: many documents with rare terms have more meaning for user most likely contain other query terms

icf

Clog

12

CORI: Using TFIDF normalization

Qq

qq ICFDF

01

506040

.log

.log..~

MAXDF

DFDF

01

50

.log

.log

~

C

CF

C

ICF

DF : document frequency of query termDFMAX : maximum document frequency among all terms in collectionCF : number of collections, containing query term|C| : number of collections in the system

13

CORI Issues

Pure document frequencies make CORI better The less statistics, the simpler Smaller variance Better estimates ranking, not actual database

summaries

No use of document richness

To be normalized or not to be? Small databases are not necessary better Collection may specialize well in several topics

14

Using usefulness measures

601520Peer3

400660Peer2

601220Peer1

DFmaxavg_tfDF

60105Peer3

400410Peer2

6085Peer1

DFmaxavg_tfDF

Peer2

Peer1

Peer3

Inform.

Peer2

Peer1

Peer3

RetrievalCORI

Peer3 0.5681

Peer1 0.5681

Peer2 0.5634

GlOSS

Peer2 845

Peer3 784

Peer1 627

Information: CF = 120 Retrieval: CF = 40|C| = 1000

Peer1

Peer3

Peer2

Inform.

Peer2

Peer1

Peer3

Inform.

15

Analysis of experiments

CORI is the best, but Only when choosing more than 50 from 236 databases Only 10% better when choosing more than 90 databases

Test collections are strange Chronologically or even randomly separated documents No topic specificity No actual Web data used No overlapping among collections

Experiments are unrealistic, it’s unclear Which method is better Is there any satisfactory method

16

Possible solutions

Most of measures could be unified in framework

We can play with it and try Various normalization schemes Different notions of term importance (ICF, local IDF) Use statistics of top documents Change the power of factors

DF·ICF 4 is not worse than CORI Change the form of expression

Qq

qqq ImportanceDFTFaverage_

GlOSS

CORI

17

Conclusion

What done: Measures are analytically evaluated Sensible subset of measures is chosen Measures are implemented

What could be done next: Carry out new sensible experiments Choose appropriate usefulness measure Experiment with database representatives Build own measure Try to exploit collections metadata

Bookmarks, authoritative documents, collection descriptions

18

Thank you for attention!

query routing in peer-to-peer web search engine speaker: pavel serdyukov supervisors: gerhard weikum...

Documents

documents keywords

documents size

document relevant

document similarity

query termscori

peer web search enginespeaker

lot of low scored documents

based measuresis