yin yang (hong kong university of science and technology) nilesh bansal (university of toronto)...

34
Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto) Dimitris Papadias (Hong Kong University of Science and Technology)

Upload: phyllis-lawson

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Yin Yang (Hong Kong University of Science and Technology)Nilesh Bansal (University of Toronto)Wisam Dakka (Google)Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto)Dimitris Papadias (Hong Kong University of Science and Technology)

Page 2: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Explosion of Web 2.0 content blogs, micro-blogs, social networking

Need for “cross reference” on the web after we read a news article, we wonder

if there are any blogs discussing it and vice versa

Page 3: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

A service of the BlogScope system a real blog search engine serving 20K

users /day Input: a text documentOutput: relevant blog postsMethodology

extract key phrases from the input document

use these phrases to query BlogScope

Page 4: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 5: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 6: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Novel Query-by-Document (QBD) model

Practical phrase extractorPhrase set enhancement with

Wikipedia knowledge (QBD-W)Evaluation of all proposed methods

using Amazon Mechanical Turk Human annotators are serious because

they get paid for the tasks

Page 7: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Example of RF

Distinctions between RF and QBD RF involves interaction, while QBD does not RF is most effective for improving recall,

whereas QBD aims at both high precision and recall

RF starts with a keyword query; QBD directly takes a document as input

Page 8: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Two classes of methods Very slow but accurate, from the machine

learning community Practical, not so accurate as the above (our

method falls in this category) Phrase extraction in QBD has distinct

goals Document retrieval accuracy is more

important than that of the phrase set itself A better phrase extractor is not necessarily

more suitable for QBD, as shown in our experiments

Page 9: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Query expansion Used when user’s keyword set does not

express herself properlyPageRank, TrustRank, …

QBD-W follows this frameworkWikipedia mining

Page 10: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Recall that Query-by-Document Extracts key phrases from the input

document And then query them against a search

engine Idea: given a query document D

Identify all phrases from D Score each individual phrase Obtain the set of phrases with highest

scores, and refine it

Page 11: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Process the document with a Part-of-Speech tagger Nouns, adjectives, verbs, …

We compiled a list of POS patterns Indexed by a POS trie forest Each term sequence following such a

POS pattern is considered a phrase

Page 12: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Pattern Instance

N Nintendo

JN global warming

NN Apple computer

JJN declarative approximate selection

NNN computer science department

JCJN efficient and effective algorithm

JNNN Junior United States Senator

NNNN Microsoft Host Integration Server

… …

NNNNN United States President Barrack Obama

Page 13: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 14: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Two scoring functions ft, based on TF/IDF

fl, based on the concept of mutual information

Page 15: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

| |

1

( ) ( ) ( )c

t ii

f c tfidf w coherence c

| |

1

( ) 1 log ( )( )

1( )

| |

c

ii

tf c tf ccoherence c

tf wc

Extract the most characteristic phrases from the input document D

But may obtain term sequences which are not really phrases Example: “moment Down Jones” in “at

this moment Dow Jones”

Page 16: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

MI: the conditional probability of a pair of events, with respect to their individual probabilities

Eliminates non-phrases

( , )( , ) log

( ) ( )

prob x yPMI x y

prob x prob y

} |

1

( )( ) log

( )c

ii

prob cPMI c

prob w

( )( )

( )c

tf cprob c

tf POS ( )

( )( )

i

ii

w

tf wprob w

tf POS

| |

1( ) ( ) ( ) log ( ) ( )

c

l iif c prob c idf w prob c PMI c

Page 17: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Take the top-k phrases with highest scores

Eliminates duplicates Two different phrases may carry similar

meanings Remove phrases who are▪ Subsumed by another with higher score▪ Differ from a better phrase only in the last

term▪ And other rules …

Page 18: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Motivation: The user may also be interested in web

documents related to the given one, but does not contain the same key phrases

Example: after reading an article on Michelle Obama, the user may also want to learn her husband, and past American presidents

Main idea: Obtain an initial phrase set with QBD Use Wikipedia knowledge to identify phrases

that are related to the initial phrases Our method follows the spreading-activation

framework

Page 19: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 20: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Given an initial phrase set Locate nodes corresponding to these

phrases on the Wiki Graph Assign weights to these nodes Iteratively spreads node weights to

neighbors▪ Assume the random surfer model▪ With a certain probability, return to one of the

initial nodes

Page 21: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

S is the initial phrase set Initial weights are normalizeds(cv) is the score of cv, assigned by

QBD

0''

( )if

( )( )

0 otherwise

v

vv S

s cv S

s cRR v

Page 22: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Wii Sony Nintendo

Play Station

Tomb Raider

Wii 0 2/10 7/10 1/10 0

Sony 0 0 0 4/4 0

Nintendo

5/6 1/6 0 0 0

Play Station

2/11 6/11 1/11 0 2/11

Tomb Raider

0 0 0 1/1 0

'' ,

if , '[ , ']

0 otherwise

e

ee v w

wte v v E

wtT v v

Page 23: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

With probability αv’ , proceed to a neighbor;

Otherwise, return to one of the initial nodes

αv’ is a function of the node v’

1 0 1' ' ' '

', ',

[ ', ] (1 )i i iv v v v v v

e v v e v v

RR RR T v v RR RR

Page 24: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

αv is not a constant, unlike other algorithms (e.g., TrustRank)

αv gets smaller, and eventually drops to zero, for nodes increasingly farther away from the initial ones Reduce CPU overhead of RelevanceRank

computation, since only a subset of nodes are considered

Important, as RelevanceRank is calculated online

Page 25: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Iteration Wii Sony Nintendo Play Station

0 0 0 1 0

1 0.67 0.13 0.1 0

2 0.13 0.06 0.74 0.06

3 0.49 0.11 0.38 0.02

4 0.25 0.08 0.62 0.05

5 0.41 0.10 0.46 0.03

… … … … …

Infinite 0.35 0.09 0.52 0.03

Page 26: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Methodology Employ human annotators at Amazon

Mturk Dataset

A random sample of news articles from the New York Times, the Economist, Reuters, and Financial Times during Aug-Sep 2007

Competitors for phrase extraction QBD-TFIDF (tf-idf scoring) QBD-MI (mutual information scoring) QBD-YAHOO (Yahoo! phrase extractor)

Page 27: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Quality of Phrase RetrievalQuality of Document RetrievalEfficiency

The total running time of QBD is negligible

Page 28: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 29: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 30: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 31: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Page 32: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

lmax Time (seconds)

1 0.160

2 1.142

3 10.262

4 57.915

5 143.828

Page 33: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

We propose the query-by-document model two effective phrase extraction algorithms enhancing the phrase set with the

Wikipedia graph Future work

more sophisticated phrase extraction (e.g., with additional background knowledge)

blog matching using key phrases

Page 34: Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)