yin yang (hong kong university of science and technology) nilesh bansal (university of toronto)...
TRANSCRIPT
Yin Yang (Hong Kong University of Science and Technology)Nilesh Bansal (University of Toronto)Wisam Dakka (Google)Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto)Dimitris Papadias (Hong Kong University of Science and Technology)
Explosion of Web 2.0 content blogs, micro-blogs, social networking
Need for “cross reference” on the web after we read a news article, we wonder
if there are any blogs discussing it and vice versa
A service of the BlogScope system a real blog search engine serving 20K
users /day Input: a text documentOutput: relevant blog postsMethodology
extract key phrases from the input document
use these phrases to query BlogScope
Novel Query-by-Document (QBD) model
Practical phrase extractorPhrase set enhancement with
Wikipedia knowledge (QBD-W)Evaluation of all proposed methods
using Amazon Mechanical Turk Human annotators are serious because
they get paid for the tasks
Example of RF
Distinctions between RF and QBD RF involves interaction, while QBD does not RF is most effective for improving recall,
whereas QBD aims at both high precision and recall
RF starts with a keyword query; QBD directly takes a document as input
Two classes of methods Very slow but accurate, from the machine
learning community Practical, not so accurate as the above (our
method falls in this category) Phrase extraction in QBD has distinct
goals Document retrieval accuracy is more
important than that of the phrase set itself A better phrase extractor is not necessarily
more suitable for QBD, as shown in our experiments
Query expansion Used when user’s keyword set does not
express herself properlyPageRank, TrustRank, …
QBD-W follows this frameworkWikipedia mining
Recall that Query-by-Document Extracts key phrases from the input
document And then query them against a search
engine Idea: given a query document D
Identify all phrases from D Score each individual phrase Obtain the set of phrases with highest
scores, and refine it
Process the document with a Part-of-Speech tagger Nouns, adjectives, verbs, …
We compiled a list of POS patterns Indexed by a POS trie forest Each term sequence following such a
POS pattern is considered a phrase
Pattern Instance
N Nintendo
JN global warming
NN Apple computer
JJN declarative approximate selection
NNN computer science department
JCJN efficient and effective algorithm
JNNN Junior United States Senator
NNNN Microsoft Host Integration Server
… …
NNNNN United States President Barrack Obama
Two scoring functions ft, based on TF/IDF
fl, based on the concept of mutual information
| |
1
( ) ( ) ( )c
t ii
f c tfidf w coherence c
| |
1
( ) 1 log ( )( )
1( )
| |
c
ii
tf c tf ccoherence c
tf wc
Extract the most characteristic phrases from the input document D
But may obtain term sequences which are not really phrases Example: “moment Down Jones” in “at
this moment Dow Jones”
MI: the conditional probability of a pair of events, with respect to their individual probabilities
Eliminates non-phrases
( , )( , ) log
( ) ( )
prob x yPMI x y
prob x prob y
} |
1
( )( ) log
( )c
ii
prob cPMI c
prob w
( )( )
( )c
tf cprob c
tf POS ( )
( )( )
i
ii
w
tf wprob w
tf POS
| |
1( ) ( ) ( ) log ( ) ( )
c
l iif c prob c idf w prob c PMI c
Take the top-k phrases with highest scores
Eliminates duplicates Two different phrases may carry similar
meanings Remove phrases who are▪ Subsumed by another with higher score▪ Differ from a better phrase only in the last
term▪ And other rules …
Motivation: The user may also be interested in web
documents related to the given one, but does not contain the same key phrases
Example: after reading an article on Michelle Obama, the user may also want to learn her husband, and past American presidents
Main idea: Obtain an initial phrase set with QBD Use Wikipedia knowledge to identify phrases
that are related to the initial phrases Our method follows the spreading-activation
framework
Given an initial phrase set Locate nodes corresponding to these
phrases on the Wiki Graph Assign weights to these nodes Iteratively spreads node weights to
neighbors▪ Assume the random surfer model▪ With a certain probability, return to one of the
initial nodes
S is the initial phrase set Initial weights are normalizeds(cv) is the score of cv, assigned by
QBD
0''
( )if
( )( )
0 otherwise
v
vv S
s cv S
s cRR v
Wii Sony Nintendo
Play Station
Tomb Raider
Wii 0 2/10 7/10 1/10 0
Sony 0 0 0 4/4 0
Nintendo
5/6 1/6 0 0 0
Play Station
2/11 6/11 1/11 0 2/11
Tomb Raider
0 0 0 1/1 0
'' ,
if , '[ , ']
0 otherwise
e
ee v w
wte v v E
wtT v v
With probability αv’ , proceed to a neighbor;
Otherwise, return to one of the initial nodes
αv’ is a function of the node v’
1 0 1' ' ' '
', ',
[ ', ] (1 )i i iv v v v v v
e v v e v v
RR RR T v v RR RR
αv is not a constant, unlike other algorithms (e.g., TrustRank)
αv gets smaller, and eventually drops to zero, for nodes increasingly farther away from the initial ones Reduce CPU overhead of RelevanceRank
computation, since only a subset of nodes are considered
Important, as RelevanceRank is calculated online
Iteration Wii Sony Nintendo Play Station
0 0 0 1 0
1 0.67 0.13 0.1 0
2 0.13 0.06 0.74 0.06
3 0.49 0.11 0.38 0.02
4 0.25 0.08 0.62 0.05
5 0.41 0.10 0.46 0.03
… … … … …
Infinite 0.35 0.09 0.52 0.03
Methodology Employ human annotators at Amazon
Mturk Dataset
A random sample of news articles from the New York Times, the Economist, Reuters, and Financial Times during Aug-Sep 2007
Competitors for phrase extraction QBD-TFIDF (tf-idf scoring) QBD-MI (mutual information scoring) QBD-YAHOO (Yahoo! phrase extractor)
Quality of Phrase RetrievalQuality of Document RetrievalEfficiency
The total running time of QBD is negligible
lmax Time (seconds)
1 0.160
2 1.142
3 10.262
4 57.915
5 143.828
We propose the query-by-document model two effective phrase extraction algorithms enhancing the phrase set with the
Wikipedia graph Future work
more sophisticated phrase extraction (e.g., with additional background knowledge)
blog matching using key phrases