concept-based p2p search how to find more relevant documents ingmar weber max-planck-institute for...

14
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino, December 8th 2004

Upload: ilene-katrina-pearson

Post on 19-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Concept-based P2P SearchHow to find more relevant documents

Ingmar WeberMax-Planck-Institute for Computer Science

Joint work withHolger Bast

Torino, December 8th 2004

Page 2: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

What‘s wrong with ?

... it‘s centralized.

• Global authority – can we trust it? - „presidential election Ukraine“ - „buy luxury car“

•Dependency – I need it. - single point of failure

• Size – even Google is only human. - unlikely to index everything that‘s of interest (deep web) - infeasible to run expensive algorithms on 8 billion documents - difficult to input human knowledge

First ...

Peer-to-peer search

Page 3: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

... it‘s term-based.

Searching for „matrix factorization“ ... but what about ...„matrix decomposition“ ... or ...„decompose linear system“ ... or even...„probabilistic latent semantic indexing“

First result:

Does not find:

Concept-based search

• We get some good hits but far from all• No big problem if looking for popular things• Then still enough good hits

What‘s wrong with ?

Second ...

A personal dilemma:Not a „Britney Spears“-like queryNot a „Britney Spears“-like query

Page 4: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Peer-to-peer searchApproach 0

• Each peer has a local crawler and index

• Nobody posts any information about local indices

• Search can only be done by (limited) flooding

• No way to know where to find information in advance

• Very low recall for unpopular queries Matrix

factorization

Relevant nerd

Page 5: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Peer-to-peer searchApproach 1

• Again local crawler and index• Union of all indices stored via a

distributed hash table (DHT)• Each peer responsible for a few

terms (i.e. keys in the DHT)• To search for „matrix

factorization“ retrieve the corresponding document lists and merge them to give result ranking

• Nice idea but infeasible• Far too much data traffic even

for medium local collectionsBritney

MatrixSpears

Factorization

Linear

Decomposition

System

A joining peer posts his full local indexFor each term in his collection he sends the

inverted list to the corresponding peerThe receiving peer merges this list with his

current listThe new peer will also become responsible for

some terms

Linear: doc 10, doc 6, doc 17

Factorization: doc 9, doc 7, doc 13

Matrix: doc 7, doc 10, doc 5

1. doc 72. doc 9...

Page 6: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Peer-to-peer searchApproach 2

• Again local crawler and index• Peers share a distributed hash

table (DHT)• Each peer responsible for a few

terms (i.e. keys in the DHT)• For each term we maintain a peer

list with statistics• To search for „matrix factorization“

retrieve the peer lists and select the most promising peers

• Send the query to these peers• The peers perform a local search

and return their best results• Merge the results• This works but ...• Performance heavily depends on

term-based peer selection• Still low recall

Britney

MatrixSpears

Factorization

Linear

Decomposition

System

A joining peer posts only statistics about his terms

For each term in his collection he sends the short statistic to the corresponding peer

The receiving peer merges this statistic with his current peer statistics for this term

The new peer will also become responsible for some terms

Linear: 20 docs, max tf 10

Factorization: peer 9, peer 13

Matrix: peer 7, peer 9

peer 9, peer 7„matrix factorization“

doc 2, doc 5, doc 7

„matrix factorization“

doc 8, doc 7, doc 11

1. doc 72. doc 2...

9

7

„Minerva“ [Weikum et al.]

Page 7: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Concept-based searchImproving recall

Basic idea: Don‘t directly compare queries with documents at the word level, but introduce one level of abstraction.

Query:„Probabilistic latent semantic indexing“

Document:„Non-negative matrix factorization“

Not similar at word level

Strong link at concept level

Concept:Approximately decomposing a non-negative

matrix into a product of two smaller non-negative matrices

Concept:Approximately decomposing a non-negative

matrix into a product of two smaller non-negative matrices

Page 8: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

How to derive concepts?• Noise reduction

– Replace documents by combinations of „simpler“ prototypes

• Document clustering– Partition documents into subsets and map the query to the best

matching set

• Query expansion– Study co-occurrence pattern of terms and add „suitable“ terms to the

query

Note: In all cases the automatically derived concepts depend on the individual corpus.

Concept-based searchImproving recall

Page 9: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Concept-based P2P searchApproach 1

• Again local crawler and index• Peers share a distributed hash table

(DHT)• Each peer responsible for a few terms

(i.e. keys in the DHT)• For each term we maintain a peer list with

statistics• Same peer selection process using query

terms as in Minerva• The only difference ...• Peers locally employ a concept-based

retrieval scheme of their choice• Locally better recall• We still merge the invididual results• This works but ...• Performance still heavily depends on

term-based peer selection• Still low recall

Britney

Matrix

Spears

Factorization

Linear

Decomposition

System

Page 10: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

What we would like to do:1. Map a query to the most relevant concepts,

either individually or with help from peers2. Find out which peers have documents related to

these concepts3. Send the query to these peers and retrieve

ranking for the concept4. Merge the results

Difficult as we have to post summaries of concepts which can then be found by others.

Not clear how to universally and uniquely represent concepts.

Concept-based P2P searchA blueprint

Page 11: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

doc 5, doc 8, doc 3, ...

A new „concept“-based schemeUsing documents as concepts

1. Precompute a ranking of documents for each document (doc-doc similarities) using cosine similarity

2. For a query find a few documents which in their combination are most likely to generate the query (EM algorithm)

3. Merge the corresponding rankings to give the final output ranking

doc 11, doc 7, doc 8, ...

doc 9, doc 1, doc 7, ...

doc 5, doc 8, doc 3, ...

doc 2, doc 5, doc 1, ...

doc 4, doc 3, doc 8, ...

11

9

5

2

4

„matrix factorization“

Experimentally, the following works very well for a local collection:

doc 11, doc 7, doc 8, ...

1. doc 82. doc 113. ...

Observation 1: highest ranked documents do not have to contain query termsObservation 2: doc-doc similarities more content-based than doc-query

Page 12: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Concept-based P2P searchApproach 2

• As before, post short statistics about each term in collection ...

• ... but also about each document• E.g., number of documents with

cosine similarity > 0.7 • Still use term statistics for first

round of peer selection (as before)• Send query to these peers (as

before)• Each selected peer locally selects

documents which are most likely to generate query

• Sends back only these few documents (term vectors)

• Initiating peer then selects a few of them (EM algorithm)

Britney

Matrix

Spears

Factorization

Linear

Decomposition

System

Factorization: peer 9, peer 13

Matrix: peer 7, peer 9

peer 9, peer 7„matrix factorization“

doc 2 w/ doc 5

„matrix factorization“

doc 8 w/ doc 5

doc 5 w/ doc 8

9

7

doc 1doc 5

doc 7

doc 2doc 3

doc 4

doc 9

doc 11

doc 8

Page 13: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

doc 5 w/ doc 8

Concept-based P2P searchApproach 2

• Then the peer lists for these documents are retrieved and a new set of peers is selected

• We then send these peers our query in terms of selected documents, i.e., „concepts“

• The peers send back the most relevant documents

• Merge the individual results Britney

Matrix

Spears

Factorization

Linear

Decomposition

System

peer 2, peer 3

2

3doc 1doc 5

doc 7

doc 2doc 3

doc 4

doc 9

doc 11

doc 8

doc 5: peer 2, peer 1

doc 8: peer 3, peer 2

doc 5 w/ doc 8

doc 10, doc 5, doc 2

doc 2, doc 3, doc 8

1. doc 22. doc 33. ...

Page 14: Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Concept-based P2P search

Advantages of our approach• Users only have to „agree“ on documents

– No need for common taxonomy

• If we can find some relevant documents we can find more => increases recall – Allows content-based „More documents like this“

button

• Uses non-trivial doc-doc similarities– Infeasible to compute for 8 billion documents,

easy to do for a few thousands