characterizing the splogosphere
DESCRIPTION
TRANSCRIPT
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Characterizing theSplogosphereTim Finin
http://ebiquity.umbc.edu/paper/html/id/299/
Pranam Kolari, Akshay Java and Tim Finin
University of Maryland, Baltimore County
3rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and
Dynamics22 May 2006
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Outline
• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
The Blogosphere
• 57% of online US teens generate content, 40% read blogs, 20% have them! (PewNov. 2005)
• 53% of companies are blogging (Guideware Oct. 2005)
• MySpace accounts for 1/3 of all web clicks (Hendler, 2006) ?!
• But … the Blogosphere is awash in spam
Source: Wikipedia
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Blogosphere/Splogosphere
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Spam in the Blogosphere
• Types: comment spam, ping spam, spam blogs• Akismet: “87% of all comments are spam”• 75% of update pings are spam (ebiquity 2005)• 20% of indexed blogs by popular blog search
engines is spam (Umbria 2006, ebiquity 2005)• “Spam blogs, sometimes referred to by the
neologism splogs, are weblog sites which the author uses only for promoting affiliated websites”
• “Spings, or ping spam, are pings that are sent from spam blogs”
1Wikipedia
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Motivation: host ads
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Motivation: index affiliates, promote pageRank
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Spings from weblogs.com
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”
“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”
“Holy Grail Of Advertising... “
“Easily Dominate Any Market, AnySearch Engine, Any Keyword.”
Where do Splogscome from?
$ 197
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Our splog bait was picked up and used by dozens of sploggers
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Our feed is RSSjacked by at least one splogger
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Why are splogs a problem?
• Splogs undermine ranking algorithms• Splogs water down search results• Splogs threaten the Web advertising
model• Splogs indulge in “plagiarism”• Splogs skew results of market research
tools• Splogs stress the Blogosphere
infrastructure of ping servers, blog search engines, etc.
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Outline
• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Splog Detection
• SVM based probabilistic splog detection (Kolari et al., 2006)
• Hand verified training set of blogs and splogs
• Precision/Recall of 87%• Bag-of-words based feature
using text on blog home-page, O(x)
• Some additional local features
wewhatwasmyorgflickrpaper600openwordsweblogmotionmethankgojanuarytrackbackarchivesnowpolitical
findinfonewsyour27anotherwebsitebestarticlesonperfectproductsuncategorized280hotresourcesinc60threecopyright
P( x is a splog | O(x) )P( x is a blog | O(x) )
top featuresblogs splogs
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Outline
• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
BlogPulse Dataset• 21 days of July 2005• 1.3 million blogs• Eliminated Live-Journal• Re-fetched blog-homepages,
many spam blogs were non-existent since spam blogs areshort lived
• Arrived at 500K samples• Set probability thresholds
to 0.2 (authentic blog) and 0.8 (splog)• Identified 27K splogs• Sampled for 27K authentic blogs
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Splogs vs. Blogs – Word Count
blogs splogs
blog
s an
d sp
logs
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
http://www.engadget.com 1942http://www.huffingtonpost.com/theblog 905http://www.crooksandliars.com 637http://blogs.guardian.co.uk/news 616http://www.littlegreenfootballs.com/weblog 611
http://spaces.msn.com/members/pony-girl 505http://spaces.msn.com/members/black-puss 505http://spaces.msn.com/members/amputee-women 505http://spaces.msn.com/members/free-stories 505http://spaces.msn.com/members/first-time-girl 505
Top 5
Top 5
Splogs vs. Blogs – In-degree
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Splogs vs. Blogs – Out-degree
http://www.xanga.com/home.aspx?user=hit_me_layoutz 273http://www.xanga.com/home.aspx?user=i_jock_layouts 271http://www.xanga.com/home.aspx?user=slp_layouts_slp 198http://spaces.msn.com/members/cyrustse1986 193http://www.xanga.com/home.aspx?user=layouts_n_codes2005 180
http://worldseriesofpokerchipscardguard.blogspot.com 898http://rule-wsop.blogspot.com 898http://worldseries-ofpoler.blogspot.com 898http://qsopcom-1.blogspot.com 898http://weopcom.blogspot.com 898
Top 5
Top 5
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Outline
• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Weblogs.com Dataset
• 20 Nov 2005 – 11 Dec 2005• 16 million update pings• Pings subdivided by
language: da, de, en, es, fi, fr, it, nl, pt, sv
• Heuristics to identify Japanese, Chinese,Korean
• Set threshold of 0.5to separate out authentic blogs from splogs.
1Thanks to James Mayfield, JHU APL
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Ping times – Italian Blogs
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Sping vs. Ping times
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Spings vs. Pings: frequencyblogs vs. their ping frequency follows a power law, but splogs vs. spings does not
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
•Close to 40% spings•Among English blogs
–75% pings are spings–Authentic blogs are 13% of all pings
•Including Info domain–50% of all pings are spings
url count
http://www.wiccapaganblog.com 1491
http://www.freecancerfacts.com/wp 1452
http://www.myaquariumiplace.com 1375
http://www.criss-angel.biz 1215
http://www.microdermabrasion-secrets.com
1211
http://www.tipstohealth.com/blog 1207
http://www.countrymusicdigest.com 1191
All Pings – 16 Million
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Outline
• Introduction• Motivation• BlogPulse Dataset• Weblogs.com Dataset• Implications
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Implications (1)
•BlogPulse dataset– Local word models most effective for fast splog
detection– If splogs escape filters, in-link and out-link
distribution point to link-based classification
•Weblogs.com dataset– Ping frequency can be useful – Splogs probably not a big problem in most
European languages. Yet.
•The nature of the domain, points to spam filters employing a multi-step, and adaptive approach, which we are currently pursuing
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Implications (2) – Filter Design
Heuristics
Spam Blog Filter
LanguageIdentifiers
Spam BlogDetectors
Blog Identifier
Blog Identifier
1 2 3 4
Authentic BlogsAuthentic Blogs
Spam BlogsSpam Blogs
IP BlacklistsIP BlacklistsSupporting Info(OPTIONAL)
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Conclusions• Blog spam is a serious problem
– Classic arms race, e.g., increased plagiarism, feedjacking
• Blog spam identification requires different tactics than used for email and Web spam– Local features effective, but not sufficient– Lots of relational features (e.g., links, ads, IP addresses, tight
but disconnected communities) but dynamism reduces effectiveness of analysis
• Getting good training sets expensive, especially in a multilingual environment.– Minute or more a judgment
• Good opportunities for infrastructure insertion, e.g., sping free ping servers
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
http://ebiquity.umbc.edu/Annotated
in OWL
For more information
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Questions?
UMBCUMBCan Honors University in an Honors University in
MarylandMaryland
Blogs – A Specialized Domain
Update Pings
Update Pings
Ping Stream
1
2
Update Stream
Fetch Content
3
4
1 2 3 4( )