improving web search results using affinity graph benyu zhang, hua li, yi liu, lei ji, wensi xi,...

25
Improving Web Search Res ults Using Affinity Grap h Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo F an, Zheng Chen, Wei-Ying Ma Microsoft Research Asia SIGIR 2005

Upload: daniel-stafford

Post on 04-Jan-2016

220 views

Category:

Documents


4 download

TRANSCRIPT

Improving Web Search Results Using Affinity Graph

Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan,Zheng Chen, Wei-Ying Ma

Microsoft Research AsiaSIGIR 2005

INTRODUCTION The top search results can hardly cover a

sufficient variety of topics (redundant) re-ranking method based on MMR

There is no indication about how informative a returned document is on the query topic (coverage) subtopic retrieval method

two novel metrics, diversity and information richness

BACKGROUND

The most famous works on link analysis PageRank and HITS algorithm

Explicit link analysis and implicit link analysis two web pages are implicitly linked if they are

visited sequentially by the same end-user. DirectHit and Small Web Search

AFFINITY RANKING

AFFINITY RANKING Diversity: Given a set of documents R , we use di

versity Div(R) to denote the number of different topics contained in R.

Information Richness: Given a document collection D={d1…dn}, we use information richness InfoRich (di) to denote richness of information contained in the document di with respect to the entire collection D.

Affinity Graph Construction

According to vector space model , similarity between a documents pair of di and dj can be calculated as

For further measurement on the significance of the similarity between each document pair, we define the affinity of dj to di as

Information Richness Computation

After obtaining Affinity Graph, we apply a link analysis algorithm similar to PageRank

M is normalized to make the sum of each row equal to 1.

Information Richness Computation

the score of document di can be deduced from those of all other document linked to it

With dumping factor c (similar to the random jumping factor in PageRank):

Information Richness Computation

information can choose where to flow according to the following two rules: With a probability c, the information will flow i

nto document nodes which di links With a probability of c 1 the information will

randomly flow into any document in the collection.

Diversity Penalty

Re-ranking Method

The re-ranking mechanism is a combination of results from fulltext search and Affinity Ranking

score-combination

Re-ranking Method

rank-combination

EXPERIMENTS

Yahoo! Directory contained a total of 292,216 categories (including leaf

categories and non-leaf categories) All categories are organized into a 16-level hierarchy. we have downloaded 792,601 documents in total.

ODP (Open Directory Project) We downloaded the directory in August, 2004. ODP

includes a total of 172,565 categories we have downloaded 1,547,000 documents in total.

EXPERIMENTS

Newsgroup dataset The Newsgroup data is composed of 256,449 posts

collected from 117 commercial application with a total size of about 400M

Title and content of the post are given a 3:1 weighting ratio in indexing process

There is no explicit link existing among the posts large amount of posts are very likely to be devoted

to the same topic

Affinity Ranking vs. K-Means Clustering

Affinity Ranking vs. K-Means Clustering

The top 1000 search results of each query are passed to AR or Kmeans algorithm to re-rank top 10 results

For K-Means algorithm, we set K=10 and use the top 1 document of each cluster to construct the top 10 results

Affinity Ranking vs. K-Means Clustering

Affinity Ranking in Newsgroup dataset Query

We compare our approach with the Okapi system in three aspects: diversity, information richness and relevance

Affinity Ranking in Newsgroup dataset

Four researchers are hired to labele the top 50 search results for each of the 20 queries based on the following steps:

Affinity Ranking in Newsgroup dataset

N is the number of users X could be diversity, information richness, or

relevance of the top search results A and F represent results from our ranking

scheme and full-text search

Improvement in Top 10 Search Results

As the top 10 search results always receive the most attention of end-users

In this experiment, we use the rank-combination scheme and which α= 0 and β =1

Improvement within Top 50 Search Results

Improvement within Top 50 Search Results

A Case Study

This example is extracted from our experiments on the Newsgroup search for the query “Outlook print error”

CONCLUSIONS Proposed two new metrics, diversity and information

richness A novel ranking scheme, Affinity Ranking, is

proposed to re-rank the search results Our experiments showed that the proposed metrics

and new ranking method can effectively improve the search performance

Future work includes scaling our Affinity Ranking computation, for example, to the Web scale