detecting spammers and content promoters in online video social networks fabrício benevenuto ∗,...

DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS

Fabrício Benevenuto∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and Marcos

Gonçalves

Computer Science Department, Federal University of Minas Gerais Belo Horizonte, Brazil

(SIR’09)

Speaker : Yi-Ling Tai

Date : 2009/09/28

OUTLINE

Introduction User test collection Analyzing user behavior attributes Detecting spammer and promoters

Evaluation metrics Experimental setup Classification Reducing attribute set

Conclusions

INTRODUCTION

YouTube is the most popular Online video social networks.

It allows users to post a video as a response to a discussion topic.

These features open opportunities for users to introduce polluted content into the system.

Pollution – spread advertise to generate sales disseminate pornography compromise system reputation

INTRODUCTION

Users cannot easily identify the pollution before watching, which also consumes system resources, especially bandwidth.

This paper address the issue of detecting video spammers and promoters.

Spammers post an unrelated video as response to a popular

video topic to increase the likelihood of the response being viewed.

Promoters post a large number of responses to boost the

rank of the video topic.

INTRODUCTION

Toward this end - crawl a large user data set from YouTube. “manually” classified user as legitimate,

spammers and promoters. study attributes to distinguish different types of

polluters . use a supervised classification algorithm to

detect spammers and promoters.

USER TEST COLLECTION

A YouTube video is a responded video or a video topic if it has at least one video response.

A YouTube user is a responsive user if she has posted at least one video response.

A responded user is someone who posted at least one responded video.

Polluter is used to refer to either a spammer or a promoter.

CRAWLING YOUTUBE

User interactions can be represented by a video response user graph.G = (X, Y)

X is the union of all users who posted or received video responses.

(x1, x2) is a directed arc in Y, if user x1 has responded to a video contributed by user x2

To build the graph, this paper build a crawler that implements Algorithm 1.

CRAWLING YOUTUBE

The sampling starts from a set of 88 seeds, consisting of the owners of the top-100 most responded videos of all time.

The crawler follows links gathering information on a number of different attributes.

264,460 users 381,616 responded videos 701,950 video responses

CRAWLING YOUTUBE

BUILDING A TEST COLLECTION

The main goal is to study the patterns and characteristics of each class of users.

The collection should include the properties having a significant number of users of all three

categories including, but not restricting to large amounts of

pollution including a large number of legitimate users with

different behavior randomly sampling may not achieve these

properties.


three strategies for user selection1. different levels of interaction

Four groups of users based on their in and out-degrees

100 users were randomly selected from each group

2. Aiming at the test collection with polluters Browsed responses of top 100 most responded

videos, selecting suspect users.

3. randomly selected 300 users Who posted video responses to the top 100 most

responded videos To minimize a possible bias by strategy2


Each selected user was then manually classified.

Three volunteers analyzed all video responses of each user to classify her into one of categories.

Volunteers were instructed to favor legitimate users.

ANALYZING USER BEHAVIOR ATTRIBUTES

We considered three attribute sets Video attributes

Duration, number of views, commentaries receivedRating, number of times to be selected favoriteNumber of honor and external links

Three video groups of the user All video uploaded by the user Video responses Responded videos which this user response to

summing up 42 video attributes for each user

ANALYZING USER BEHAVIOR ATTRIBUTES User attributes

number of friends, number of videos uploaded, number of videos watched,number of videos added as favorite, numbers of video responses posted and received, numbers of subscriptions and subscribers, average time between video uploads, maximum number of videos uploaded in 24 hours.


Social network attributes Clustering coefficient

cc(i), is the ratio of the number of existing edges between i’s neighbors to the maximum possible number

Betweenness Reciprocity

Assortativity The ratio between the node (in/out) degree and the

average (in/out)degree of its neighbors. UserRank

ANALYZING USER BEHAVIOR ATTRIBUTES two well known feature selection methods.

Information gain (Chi Squared)

EVALUATION METRICS

use the standard information retrieval metrics Recall Precision Micro-F1

first computing global precision and recall values for all classes.

then calculating F1 Macro-F1

first calculating F1 values for each class in isolation then averaging over all classes

EVALUATION METRICS

confusion matrix

EXPERIMENTAL SETUP

libSVM - an open source SVM package allows searching for the best classifier

parameters using the training data provides a series of optimizations, including

normalization of all numerical attributes.

5-fold cross-validation. repeated 5 times with different seeds used to

shuffle the original data set. producing 25 different results for each test.

TWO CLASSIFICATION STRATEGIES

flat classification promoters (P), spammers (S), and legitimate

users (L)

hierarchical strategy first separate promoters (P) from non-promoters

(NP) heavy (HP) and light promoters (LP) legitimate users (L) and spammers (S)

FLAT CLASSIFICATION

confusion matrix obtained

The numbers presented are percentages relative to the total number of users in each class.

The diagonal indicates the recall in each class. no promoter was classified as legitimate user. 3.87% -

videos actually acquired popularity. harder to distinguish them from spammers.

FLAT CLASSIFICATION

41.91% - Legitimate users post their video responses to

popular responded videos(a typical behavior of spammers).

Micro-F1 = 87.5, with per-class F1 values are 90.8, 63.7, and 92.3

Macro-F1 = 82.2

HIERARCHICAL CLASSIfiCATION

Binary classification J parameter - one can give priority to one

class (e.g., spammers) over the other (e.g., legitimate users) .

promoters VS non-promoters

Macro-F1 = 93.44 Micro-F1 = 99.17

NON-PROMOTERS

We trained the classifier with the original training data without promoters.

with J=1

J = 0.1 - 24% VS 1% J = 3.0 - 71% VS 9%

The best solution depends on the system administrator’s objectives.

HEAVY AND LIGHT PROMOTERS

Aggressiveness Maximum number of video responses posted in a

24-hour period.

k-means clustering algorithm was used to separate promoters into two clusters.

Average aggressiveness Light promoters = 15.78 (CV=0.63) Heavy promoters = 107.54 (CV=0.61)

HEAVY AND LIGHT PROMOTERS

Binary classification retrained with the original training data

containing only promoters

REDUCING THE ATTRIBUTE SET

Two scenarios - Decreasing order of position in the χ2 ranking Evaluating classification when subsets of 10

attributes occupying contiguous positions

CONCLUSIONS

An effective solution to detect spammers and promoters in online video social networks.

Flat classification approach provides alternative to simply considering all users as legitimate.

Hierarchical approach explores different classification tradeoffs and provides more flexibility for the application.

Finally, we can produce significant benefits with only a small subset of less expensive attributes.

Spammers and promoters will evolve and adapt to anti-pollution strategies, periodical assessment of the classification process may be necessary.

detecting spammers and content promoters in online video social networks fabrício benevenuto ∗,...

Documents