1/20 svms for the blogosphere: blog identification and splog detection pranam kolari | tim finin |...

20
1/20 SVMs for the SVMs for the Blogosphere: Blog Blogosphere: Blog Identification Identification and Splog Detection and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County AAAI Spring 2006 Symposia on Computational Approaches to Analysing Weblogs

Upload: arline-powell

Post on 31-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

1/20

SVMs for the Blogosphere: SVMs for the Blogosphere: Blog Identification Blog Identification

and Splog Detectionand Splog Detection

Pranam Kolari | Tim Finin | Anupam JoshiUniversity of Maryland, Baltimore County

AAAI Spring 2006 Symposia onComputational Approaches to Analysing Weblogs

Page 2: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

2/20

Abstract

• BlogosphereBlogosphere becomes popular becomes popular

• Key requirement for a blog search engine– Crawl the web– Index blogs only (Blog Identifications)– Remove Spam Blogs (splogs)

• This paper– Incorporate SVM– Introduce new features

Page 3: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

3/20

Introduction

• Highly dynamic and dated content

• Blog search engine Capabilities– Recognize blog sites– Understand the structure– Identify the constituent part– Extract relevant data– Detect and eliminate Splogs

Page 4: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

4/20

Introduction (cont.)

• blog search engines identify blogs and index content based on– Update pings received from ping servers

會吐一種叫 changes.xml– Directly from blogs, or through crawling blog

directories and blog hosting services– Continue to crawl the Web to discover,

identify and index blogs T– The source of the pings need to be verified as

a blog prior to indexing content

Page 5: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

5/20

Data Collection

4 種部落格資料來源• Processing streams of pings from blog ping

servers containing URLs of new posts• Crawling directories which allow robot indexing

of their site content• Using developer API’s provided by popular blog

search engines• Regular monitoring existing ping update services

and blog hosting services (“changes.xml”) for republished pings

Page 6: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

6/20

抓 BLOG 的 API• http://www.technorati.com/developers/• 隨機從字典選詞 query 了四個月得到 50 萬筆來篩選

• Blog ping update services– http://weblogs.com/

Page 7: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

7/20

Preliminaries 分兩類 正面的• Wikipedia Definition

– “A blog, originally (and still) known as a weblog, is a web-based publication consisting primarily of periodic articles (normally, but not always, in reverse chronological order).”• a blog post be dated• a blog post either provide one of comments or a

trackback, or be uniquely addressable through permalinks

• not be a forum or traditional news website (edited content)

1. In addition to a home page listing the most recent blogs, a blog also consists of other pages for category, author specific and individual blog posts

Page 8: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

8/20

名詞補充• Permalink - permanent link

– 讓 URL 看起來更漂亮更易懂 – http://ying.homedns.org/wp/archives/2005/03/26/20/53/24/– 伺服器要有這個功能需 URL rewrite 的模組

• Trackback– 寫一項 blog 時伺服器會通知另一伺服器你寫了一篇她可能有興趣

的 blog• 在人家的 blog ,想作出詳細回應。久了也難找回。寫在自己

的 ,對方或對方 blog 讀者很難知道• 在自己 blog 站貼完,再把自已 blog 的連結及簡短內容貼在

人家 blog 站的回應上• 見對方 blog 有個 trackback url 時,抄下這網址 。當你寫

blog 回應該文時,將那網址填入你 blog 軟件的 trackback url 欄

• 聯播檔案 (Syndication feeds)– RSS 用來交換內容的格式 , 提供外界定期下載 , 1.0, 2.0

Page 9: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

9/20

Splog 負面的• Web Spam

– Any deliberate human action meant to trigger an unjustifiably favorable relevance or importance for some page, considering the page’s true value

• Splog is any such web spam page that is a blog– link spam blogs and fake blogs

最後篩選出 2600 個 BLOG 首頁

從內部連結找到 2100 子頁

700 個 Splog

Page 10: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

10/20

Using SVMs

• identification of both blog home pages and all blog pages (e.g. category page, user page, post page)

• many blogs can be identified by simple URL pattern matching

• identification problem for the subset of blogosphere consisting of self-hosted blogs and the many less popular hosting services and compare them against human baselines

大部落格網站的不是這邊要對付的

Page 11: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

11/20

Baseline

• Simple Heuristic– 例如只看 Meta 只看關鍵字決定是否為 Blog

• http://antisplog.net/ - 66%

Page 12: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

12/20

HTML 補充 (1)• 連接到一個外部的樣式表• 一個外部的樣式表可以通過 HTML 的 LINK 元素連接到 HTML

文檔中

<LINK REL=StyleSheet HREF="style.css" TYPE="text/css" MEDIA=screen><LINK REL=StyleSheet HREF="color-8b.css" TYPE="text/css" TITLE="8-bit

Color Style" MEDIA="screen, print"><LINK REL="Alternate StyleSheet" HREF="color-24b.css" TYPE="text/css"

TITLE="24-bit Color Style" MEDIA="screen, print">

• <LINK> 標記是放置在文檔的 HEAD 部分。可選的 TYPE 屬性用於指定媒體類型

• text/css 是一個層叠樣式表 -- 允許瀏覽器忽略它們不支持的樣式表類型。爲 CSS 文件配置服務器而將 text/css 當作Content-type 內容發送出去也是一個好注意

Page 13: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

13/20

HTML 補充 (2) META• <META NAME="keywords" CONTENT="key,word,keyword, 關鍵字 ">

告訴搜尋引擎這篇網頁的關鍵字。

• <META NAME="description" CONTENT=" 好棒的網頁啊! ">告訴搜尋引擎這篇網頁的內容或摘要。

• <META NAME="generator" CONTENT="Notepad">告訴搜尋引擎這篇網頁是用什麼軟體製作的。

• <META NAME=“author” CONTENT=“ 洋蔥 ">

• <META NAME=“copyright” CONTENT=“ 著作權屬洋蔥所有 ">

• <META NAME="expires" CONTENT="31 December 2002">告訴搜尋引擎這篇網頁何時需要從登錄資料庫中刪除。

Page 14: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

14/20

SVM

• 處理分類問題 , 兩類 {1, -1}• 需要 training

x1 x2 x3 x4 x5

N1 -1 1 2 3 5 0.2

N2 1 0.8 3 7 0.2 9

Page 15: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

15/20

Feature Vector

• A set of features as { f1, f2 ..fm }

• Every document has a specific signature for these features represented by a weight as { w(f1), w(f2) ... w(fm) }

• Web page classification has required– Using different types of features (feature types)– Selecting the best (feature selection)– Encoding their values in different ways (feature

representation)

Page 16: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

16/20

Feature Types• Traditional Features

– Bag-of-words– Web tags

• Local and non-local links– Bag-of-anchors– Bag-of-links

• N-gram characters –(multilingual)

• Feature Representation– Normalized TF

[0, 1] in binary• Feature Selection

– Mutual Information

Feature types and their feature vector sizes

Page 17: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

17/20

BLOG 首頁偵測

Page 18: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

18/20

BLOG 子頁偵測

Page 19: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

19/20

SPAM 偵測

Page 20: 1/20 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari | Tim Finin | Anupam Joshi University of Maryland, Baltimore County

20/20

Discussion

• 錯誤分析– blogs in languages other than English

http://www.starfrosch.ch/starfrosch/– those not allow direct commenting on posts (only

using permalink)– ones that appeared to be only fuzzily classifiable (e.g.

http://www.standeatherage.com/)– blogs with no posts– non-post pages (such as author bios)– functional pages on Blog sites