behavior-driven clustering of queries into topics
DESCRIPTION
Behavior-driven clustering of queries into topics. Luca Maria Aiello Debora Donato Umut Ozertem Filippo Menczer. CIKM 2011, Glasgow. Granularity levels. Query Session Goal Mission Topic. Concise representation. Aggregation. Meaningful semantics. USER PROFILING IN SEARCH ENGINES. - PowerPoint PPT PresentationTRANSCRIPT
CIKM 2011, Glasgow
Behavior-driven clustering of
queries into topics
Luca Maria AielloDebora DonatoUmut OzertemFilippo Menczer
CIKM 2011 2
USER PROFILING IN SEARCH ENGINES
Granularity levels
Aggregation
27/10/2011
Concise representation
Meaningful semantics
Query
Session
Goal
Mission
Topic
CIKM 2011 3
MISSIONS AND TOPICS
A topic is a mental object or cognitive content, i.e., the sum of what can be perceived, discovered or learned about any real or abstract entity.
A search mission can be identified as a set of queries that express a complex search need, possibly articulated in smaller goals
27/10/2011
CIKM 2011 4
QUERY STREAM DECOMPOSITION27/10/2011
Queries in the same mission
Same topic
Queries in consecutive missions
Different topic
Donato et. al:Do you want to take notes? Identifying research missions in Y! search pad. WWW’10Taxonomies User behavior and intent
CIKM 2011 5
MERGING MISSIONS27/10/2011
CIKM 2011 6
TOPIC DETECTOR STATS
• Gradient Boosted Decision Tree (GBDT)• Aggregation (min, max, avg, std) of 62 query pair
features
AUC 0.9510X cross validation on 500K pairs
27/10/2011
Lexical Features Behavioral features
Trigrams/terms cosine Probability fwd
Common prefix/suffix Session total click avg
Length difference Session total time avg
… …
CIKM 2011 7
• Topic detector applied to pairs of query sets• O(log|M|·|M|2) (heavily parellelizable)
1. Missions of the same user supermissions
2. Query sets of different users higher-level topics
GREEDY AGGLOMERATIVE TOPIC EXTRACTION (GATE)27/10/2011
EVALUATION
40K users
3 months Y! log
CIKM 2011 9
EVALUATION: BASELINE
• OSLOM community detection algorithm– Weighted undirected graph– Maximizing local fitness function of clusters– Automatic hierarchy detection
Lancichinetti et. al:Finding statistically significant communities in networks. PLoS ONE, 2011.
27/10/2011
2URL cover graph
CIKM 2011 10
EVALUATION: QUERY SET COVERAGE
Fraction of queries considered in the clustering phase
URL cover graph connected components size distribution
GATE: 1 OSLOM 0.2
27/10/2011
CIKM 2011 11
EVALUATION: SINGLETON RATIO
Fraction of queries that remains isolated in singleton
GATE: 0.55-0.27 OSLOM 0.88
27/10/2011
CIKM 2011 12
EVALUATION: AGGREGATION ABILITY
Topics aggregated in two consecutive steps or levels
GATE: 500k OSLOM:100K
27/10/2011
CIKM 2011 13
EVALUATION: PURITY vs. COVERAGE
• Coverage– Number of unique clicked URLs for the query
• Purity– Average pointwise mutual information of pairs
of query-related relevant terms• Relevant terms are extracted from top clicked
results using a predefined dictionary
27/10/2011
CIKM 2011 14
EVALUATION: PURITY vs. COVERAGE27/10/2011
CIKM 2011 15
EVALUATION: PURITY vs. COVERAGE27/10/2011
USER PROFILING
CIKM 2011 17
USER PROFILING FROM TOPICS27/10/2011
TopicDetector
Missions
Topics
0.0 0.0 0.00.72.9 3.2 1.90.35 0.41 0.24 User topicalprofile
CIKM 2011 18
PROFILES FOR “PREDICTION”
• Sequence of missions of the profiled user vs. sequence of a random one
• Sequence-profile match using topic detector• Success: 0.65 (0.72 less frequent, 0.55 most frequent)
27/10/2011
CIKM 2011 19
CONCLUSIONS
• New behavior-driven notion of topics• Bottom-up topic extraction algorithm• Favorable comparison with graph-based clustering• Effective user profiling
• Other baselines• More accurate predictions
27/10/2011
ACKNOWLEDGMENTS
Fil MenczerProf. Informatics @ IUDirector CNetS @IU
Umut OzertemYahoo! Search SciencesYahoo! Labs @ Sunnyvale
Emre VelisapaogluYahoo! Search Sciences
Yahoo! Labs @ Sunnyvale
Debora DonatoYahoo! Search Sciences
Yahoo! Labs @ Sunnyvale
CIKM 2011 2227/10/2011
Taxonomies User behavior and intent