affinity rank yi liu, benyu zhang, zheng chen msra

Affinity RankYi Liu, Benyu Zhang, Zheng Chen

MSRA

Outline Motivation Related Work Model & Algorithm Evaluation Conclusion & Future work

Search for Useful Information

Full-text search

Importance Judgment

Manual compilation

Failure Still Exists

Example – “Spielberg”

Search

Example – “Spielberg” Search (Cont.)

Motivation Existing problem in IR applications

Similar search results dominate in top one/two pages Users feel tired to similar results of same topic Users cannot find what they need in those similar results

Situations where problem are/will be intensified Highly repetitive corpus, e.g.

Newsgroup News archive Specialized website

Generalized or short query

Diversity & Informativeness Diversity The coverage of different topics of a group of documents

InformativenessTo what extent a document can represent its topic locality

(high informativeness: inclusive)

Why? Traditional IR evaluation measure

Maximize relevance between query & results Most important results

To end-usersrelevant + important ≠ desirable

A way out Increase diversity in top results Increase the informativeness of each single results

Basic Idea Build similarity-based link map Link analysis Affinity Rank

indicating the informativeness of each document Rank adjustment

Only the most informative of each topic can rank high

Re-rank with Affinity Rank More diversified top results More informative top results

Related Work – link analysis Explicit

PageRank (Page et al. 1998)

HITS(Kleinberg, 1998)

Implicit DirectHit

(http://www.directhit.com) Small Web Search

(Xue et al. 2003)

Web author’s perspective

End-user’s perspective

Subjective Objective

Related Work – Clustering

Algorithm Complexity Naming

Scatter/Gather* O(kn) Centroid + ranked words

TopCat High Set of named entities

WBSC* O(m2+n) Ranked words

STC* O(n) Sets of N-grams

IF O(kn) -

PRSA O(knm) Ranked words

Bipartite O(nm)? Ranked words

n: #doc k: #clusters m: #words

* applied on clustering search results

Our proposed IR framework

AffinityGraph

Informativeness

DiversityPenalty

Relevance

DocumentCollection

Query

Query-independent

Query-dependentOutput

AffinityRank

Re-rank

Link Construction Similarity to directed link Directed graph Threshold

Save storage space Reduce noise brought by

overwhelmingly large amount of weak-similarity-links

BA

),cos(),( BABAsim

A

BAsimBAaff

),(),(

BA

B

BAsimABaff

),(),(

AssumptionObservation : relation among documents varies

Some are similar, others are not Similarity varies

The more relatives a document has, the more informative it is itself

The more informative a document’s relatives are, the more informative it is itself

Link Analysis Link map adjacency matrix Row Normalize Based on two assumptions

Principal eigenvector rank score Implementation: Power Method

ijall

jiji MARAR

,

~ M

~

n

cMARcAR

ijalljiji

)1(~

,

en

cc

)1(~ M

“Random Transform” Model A transforming document

jump from doc. to doc. at each time step

Markov Chainstationary transition probability principle eigenvector

informativeness

ccurrent

doc.

“relative” doc.

randomly picked doc.c1

) ( affinity

Rank Adjustment Greedy-like Algorithm

decrease the score of j by the part conveyed from i (the most informative one in the same topic)

T1-1

T1-6T1-5

T1-4

T1-3

T1-2

T2-3

T2-3

T2-1

iijjj ARMARAR ,

~

Re-rank Score-combine scheme

where

Rank-combine scheme

i

i

ii d

AR

AR

qSim

dqSimdqScore ,

log

log

)(

),(),(

),()( id dqSimMaxqSimi

id ARMaxARi

iARdqSimi dRankRankdqScoreii

, ),( ),(

1

Advantages of Affinity Rank Give attention to both diversity and

informativeness Implicitly expand the query towards the

multiple topics Automatically pick the representative ones for

each chosen topic Most of the computation can be computed

OFFLINE

Experiment Setup Dataset

Microsoft Newsgroup 117 Office product related newsgroups

256,449 posts (mainly in 4 months), about 400M Preprocess

Title & text body (citation, signature, etc. stripped) Stemming, stop words removal, tfidf weighting

Query Randomly picked 20 query scenarios with query words

Search Results Okapi Top 50 results as answer set

Evaluation – ground truth User Study

4 users independently evaluate all results For each query

First manually cluster all results into different topics Then score each result in terms of its informativeness in

corresponding topic Finally score each result in terms of its relevance to the

query Evaluation

Compare original ranking with new ranking (re-ranked by Affinity Rank)

3 aspects of ranking concerned -- diversity, informativeness & relevance in top n results

Definitions Diversity

diversity = No. of different topics in a document group

Informativeness3 - very informative2 - informative1 - somewhat informative0 - not informative

Relevance1 - relevant0 – hard to tell-1 - irrelevant

Experiment Result (1) Top 10 search results

Compared to traditional IR results

DiversityInformative

nessRelevance

RelativeChange

+31.02% +11.97% +0.72%

p value(t-test)

0.004632 0.002225 0.067255

Significant improvement in diversity & informative

without loss in relevance

Experiment Result (2) Diversity Improvement Informative Improvement

Affinity Rank efficiently improves both diversity & informativeness of top search results

(Re-ranking top 50 results all by Affinity Rank, e.g. )0iw

Experiment Result (3) - Parameter Tuning

Top 10 search results

Affinity Rank is robust

1. Parameter doesn’t affect much if enough weight is given

2. No over-tune problem - Simply re-rank all by Affinity Rank is nearly optimal)

Experiment Result (4) - Parameter Tuning

Improvement overview subject to weight adjustment

Affinity Rank STABLELY exerts positive influence on

diversity & informativeness enhancement

Conclusion A new IR framework Affinity Rank can help to improve diversity &

informativeness of search results, especially for TOP ones

Affinity Rank is computed offline, therefore brings few burden in online retrieval

Future work Metrics for information quantity measurement Scale to large collection

Thanks

affinity rank yi liu, benyu zhang, zheng chen msra

Documents

resultsmore informative

query scenarios

affinity rankmore

rank adjustmentgreedy

clustering search resultsour

relative doc

affinity rankyi liu

similar resultssituations