identifying and ranking topic clusters in the blogosphere
DESCRIPTION
Slides presented in COLING 2010 workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources.TRANSCRIPT
Identifying and Ranking Topic Clusters in the Blogosphere
Muhammad Atif QureshiKorea Advanced Institute of Science and Technology
Arjumand YounusKorea Advanced Institute of Science and Technology
Muhammad SaeedUniversity of Karachi
Nasir TouheedInstitute of Business Administration
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
1COLING 2010 CCSR WORKSHOP
Web 1.0 to Web 2.0
Paradigm shift From a read-only Web to a read-write Web Increased user participation User generated content
Wikis (Wikipedia, Wiktionary) Social networking sites (Facebook, Myspace, Twitter) Digital media sharing websites (YouTube, Flickr) Blogs (Blogspot, Wordpress)
2COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
The Blogosphere
Blogs empower people to voice their opinions and share their ideas.
Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.
How can we identify these topic clusters? Who is most influential blogger in a given cluster?
3COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
The Blogosphere
Blogs empower people to voice their opinions and share their ideas.
Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.
How can we identify these topic clusters? Who is most influential blogger in a given cluster?
4COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
The Blogosphere
Blogs empower people to voice their opinions and share their ideas.
Bloggers also have the option to link to other blogs – social network of bloggers sharing interests in same topics.
How can we identify these topic clusters? Who is the most influential blogger in a given cluster?
5COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Problem Definition
Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in
some particular topic. Which blog holds the greatest influence for the
particular topic?
6COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Problem Definition
Given the blogosphere with blogs containing diverse information on a broad range of topics: Find the cluster of blogs to read that have interest in
some particular topic. Which blog holds the greatest influence for the
particular topic?
7COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Link Based Methods for the Blogosphere
Link based methods don’t work well for the blogosphere Weakly linked nature of blog pages Blog posts need some time to get in-links Bloggers try to exploit the link based methods by
assuming role of spammers
8COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
9COLING 2010 CCSR WORKSHOP
Blog Communities vs. Topic Clusters
Blog community Discovered by following blog threads’ discussions
Topic clusters Role of blogs as conversational medium diminished Bloggers having interest in a specific topic form
socially linked network with other bloggers writing about same topic
10COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Blog Dimensions
Blog considered along three dimensions: Part of speech Occurrence Blog post no.
11COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Topic Discussion Isolation Rank
Metric used to discover the topic clusters Based on set of given topic words and some linguistic rules
We define the TDIR score of a blog as follows:
nnoun, nadjective and nadverb is respectively the number of times a noun, adjective or adverb for a specific topic are found in all the blog posts
wn, wadj and wadv are respective weights assigned to the noun, adjective and adverb for a specific topic
posts total of Number
wnwnwnTDIR
advadverbadjadjectivennoun )()()(1
12COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Topic Discussion Rank
Metric used to rank the blogs within a topic cluster Based on hyperlinked social network of blogs and blog post
contents
We define the TDR score of a blog as follows:
Matching_Outlinks represent blogs that are part of topic cluster
o : (o,b) – outlinks from blog b
damp is the damping factor
otherwise damp; x TDIR inksTotal_OutlutlinksMatching_O TDIR
blog from outlinks zero if TDIR;b TDR
boo ),(:
][
13COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Role of Damping Factor
Assume TDIR of blog A is 2 and TDIR of blog B is 1
TDR without damping factor A: 2 + (1/1 x 1) = 3 B: 1 + (1/1 x 2) = 3
TDR with damping factor A: 2 + (1/1 x 1 x 0.9) = 2.9 B: 1 + (1/1 x 2 x 0.9) = 2.8
14COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
15COLING 2010 CCSR WORKSHOP
Experimental Setup
Experimental data Real blog data collected during crawling of blogspot
domain 102 blog sites comprising of 50,471 blog posts
Experimental topics “compute”, “democracy”, “secularism”,
“bioinformatics”, “Haiti”, “Obama”
16COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Experimental Measures
Precision
Recall
Ca represents topic cluster set found using our algorithmCt represents true topic cluster set
Ca
CaCt
Ct
CaCt
17COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Experimental Results - Precision
Average precision found to be 0.87
18COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Experimental Results - Recall
Average recall found to be 0.971
19COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Outline
Introduction
Approach
Experiments and Results
Conclusions
1
2
3
4
20COLING 2010 CCSR WORKSHOP
Conclusions
This work presents the concept of “topic clusters” to solve the blog categorization problem for the Information Retrieval domain.
The proposed method takes into account both blog posts’ content and link structure.
Natural language processing techniques incorporated into the method ensure high coverage.
The method was evaluated using a real word dataset of the blogspot domain.
21COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions
Appendix
23COLING 2010 CCSR WORKSHOP
Additional Experiments
Experiment on topic “Obama” repeated with additional term “Democrats” Precision increased from 0.907 to 0.95 Ranks of some blogs higher than ranks obtained
previously
Two more experiments on fine-grained topics Healthcare bill: Precision was found to be 0.857 and
recall obtained was 1; additional term “obamacare” was used
Avatar: Precision was found to be 0.47 and recall obtained was 1; additional terms had no effect
24COLING 2010 CCSR WORKSHOP
Introduction Approach Experiments and Results Conclusions