yin yang (hong kong university of science and technology) nilesh bansal (university of toronto)...

Yin Yang (Hong Kong University of Science and Technology)Nilesh Bansal (University of Toronto)Wisam Dakka (Google)Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto)Dimitris Papadias (Hong Kong University of Science and Technology)

Explosion of Web 2.0 content blogs, micro-blogs, social networking

Need for “cross reference” on the web after we read a news article, we wonder

if there are any blogs discussing it and vice versa

A service of the BlogScope system a real blog search engine serving 20K

users /day Input: a text documentOutput: relevant blog postsMethodology

extract key phrases from the input document

use these phrases to query BlogScope

Novel Query-by-Document (QBD) model

Practical phrase extractorPhrase set enhancement with

Wikipedia knowledge (QBD-W)Evaluation of all proposed methods

using Amazon Mechanical Turk Human annotators are serious because

they get paid for the tasks

Example of RF

Distinctions between RF and QBD RF involves interaction, while QBD does not RF is most effective for improving recall,

whereas QBD aims at both high precision and recall

RF starts with a keyword query; QBD directly takes a document as input

Two classes of methods Very slow but accurate, from the machine

learning community Practical, not so accurate as the above (our

method falls in this category) Phrase extraction in QBD has distinct

goals Document retrieval accuracy is more

important than that of the phrase set itself A better phrase extractor is not necessarily

more suitable for QBD, as shown in our experiments

Query expansion Used when user’s keyword set does not

express herself properlyPageRank, TrustRank, …

QBD-W follows this frameworkWikipedia mining

Recall that Query-by-Document Extracts key phrases from the input

document And then query them against a search

engine Idea: given a query document D

Identify all phrases from D Score each individual phrase Obtain the set of phrases with highest

scores, and refine it

Process the document with a Part-of-Speech tagger Nouns, adjectives, verbs, …

We compiled a list of POS patterns Indexed by a POS trie forest Each term sequence following such a

POS pattern is considered a phrase

Pattern Instance

N Nintendo

JN global warming

NN Apple computer

JJN declarative approximate selection

NNN computer science department

JCJN efficient and effective algorithm

JNNN Junior United States Senator

NNNN Microsoft Host Integration Server

… …

NNNNN United States President Barrack Obama

Two scoring functions ft, based on TF/IDF

fl, based on the concept of mutual information

| |

1

( ) ( ) ( )c

t ii

f c tfidf w coherence c

| |

1

( ) 1 log ( )( )

1( )

| |

c

ii

tf c tf ccoherence c

tf wc

Extract the most characteristic phrases from the input document D

But may obtain term sequences which are not really phrases Example: “moment Down Jones” in “at

this moment Dow Jones”

MI: the conditional probability of a pair of events, with respect to their individual probabilities

Eliminates non-phrases

( , )( , ) log

( ) ( )

prob x yPMI x y

prob x prob y

} |

1

( )( ) log

( )c

ii

prob cPMI c

prob w

( )( )

( )c

tf cprob c

tf POS ( )

( )( )

i

ii

w

tf wprob w

tf POS

| |

1( ) ( ) ( ) log ( ) ( )

c

l iif c prob c idf w prob c PMI c

Take the top-k phrases with highest scores

Eliminates duplicates Two different phrases may carry similar

meanings Remove phrases who are▪ Subsumed by another with higher score▪ Differ from a better phrase only in the last

term▪ And other rules …

Motivation: The user may also be interested in web

documents related to the given one, but does not contain the same key phrases

Example: after reading an article on Michelle Obama, the user may also want to learn her husband, and past American presidents

Main idea: Obtain an initial phrase set with QBD Use Wikipedia knowledge to identify phrases

that are related to the initial phrases Our method follows the spreading-activation

framework

Given an initial phrase set Locate nodes corresponding to these

phrases on the Wiki Graph Assign weights to these nodes Iteratively spreads node weights to

neighbors▪ Assume the random surfer model▪ With a certain probability, return to one of the

initial nodes

S is the initial phrase set Initial weights are normalizeds(cv) is the score of cv, assigned by

QBD

0''

( )if

( )( )

0 otherwise

v

vv S

s cv S

s cRR v

Wii Sony Nintendo

Play Station

Tomb Raider

Wii 0 2/10 7/10 1/10 0

Sony 0 0 0 4/4 0

Nintendo

5/6 1/6 0 0 0

Play Station

2/11 6/11 1/11 0 2/11

Tomb Raider

0 0 0 1/1 0

'' ,

if , '[ , ']

0 otherwise

e

ee v w

wte v v E

wtT v v

With probability αv’ , proceed to a neighbor;

Otherwise, return to one of the initial nodes

αv’ is a function of the node v’

1 0 1' ' ' '

', ',

[ ', ] (1 )i i iv v v v v v

e v v e v v

RR RR T v v RR RR

αv is not a constant, unlike other algorithms (e.g., TrustRank)

αv gets smaller, and eventually drops to zero, for nodes increasingly farther away from the initial ones Reduce CPU overhead of RelevanceRank

computation, since only a subset of nodes are considered

Important, as RelevanceRank is calculated online

Iteration Wii Sony Nintendo Play Station

0 0 0 1 0

1 0.67 0.13 0.1 0

2 0.13 0.06 0.74 0.06

3 0.49 0.11 0.38 0.02

4 0.25 0.08 0.62 0.05

5 0.41 0.10 0.46 0.03

… … … … …

Infinite 0.35 0.09 0.52 0.03

Methodology Employ human annotators at Amazon

Mturk Dataset

A random sample of news articles from the New York Times, the Economist, Reuters, and Financial Times during Aug-Sep 2007

Competitors for phrase extraction QBD-TFIDF (tf-idf scoring) QBD-MI (mutual information scoring) QBD-YAHOO (Yahoo! phrase extractor)

Quality of Phrase RetrievalQuality of Document RetrievalEfficiency

The total running time of QBD is negligible

lmax Time (seconds)

1 0.160

2 1.142

3 10.262

4 57.915

5 143.828

We propose the query-by-document model two effective phrase extraction algorithms enhancing the phrase set with the

Wikipedia graph Future work

more sophisticated phrase extraction (e.g., with additional background knowledge)

blog matching using key phrases

yin yang (hong kong university of science and technology) nilesh bansal (university of toronto)...

Documents

phrase slide

input slide

technology slide

blogscope slide

set of phrases

qbd rf

query document d

key phrases