using rank propagation for spam detection (webkdd 2006)

74
Using rank propagation and Probabilistic counting for Link-Based Spam Detection L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates Motivation Spam pages characterization Truncated PageRank Counting supporters Experiments Conclusions Using rank propagation and Probabilistic counting for Link-Based Spam Detection Luca Becchetti 1 , Carlos Castillo 1 ,Debora Donato 1 , Stefano Leonardi 1 and Ricardo Baeza-Yates 2 1. Universit` a di Roma “La Sapienza” – Rome, Italy 2. Yahoo! Research – Barcelona, Spain and Santiago, Chile August 20th, 2006

Upload: carlos-castillo

Post on 28-Jan-2015

108 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Using rank propagation and Probabilisticcounting for Link-Based Spam Detection

Luca Becchetti1 Carlos Castillo1Debora Donato1Stefano Leonardi1 and Ricardo Baeza-Yates2

1 Universita di Roma ldquoLa Sapienzardquo ndash Rome Italy2 Yahoo Research ndash Barcelona Spain and Santiago Chile

August 20th 2006

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation

2 Spam pages characterization

3 Truncated PageRank

4 Counting supporters

5 Experiments

6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information

+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn

+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 2: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation

2 Spam pages characterization

3 Truncated PageRank

4 Counting supporters

5 Experiments

6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information

+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn

+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 3: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information

+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn

+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 4: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information

+ Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn

+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 5: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn

+ On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 6: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

What is on the Web

Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V-4-gra + Get rich now now now

Graphic wwwmilliondollarhomepagecom

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 7: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (keywords + links)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 8: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Web spam (mostly keywords)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 9: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 10: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fake search engine

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 11: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 12: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem ldquonormalrdquo pages that are spam

Some content is introduced

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 13: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Problem borderline pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 14: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 15: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 16: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 17: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Definitions

Any deliberate action that is meant to triggeran unjustifiably favorable relevance or importance

for some Web page considering the pagersquos true value[Gyongyi and Garcia-Molina 2005]

any attempt to deceive a search enginersquos relevancy algorithm

or simply

anything that would not be doneif search engines did not exist

[Perkins 2001]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 18: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 19: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Link farms

Single-level farms can be detected by searching groups ofnodes sharing their out-links [Gibson et al 2005]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 20: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 21: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Linkminusbased spam

[Fetterly et al 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages

ldquoin a number of these distributions outlier values areassociated with web spamrdquo

Research goal

Statistical analysis of link-based spam

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 22: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Idea count ldquosupportersrdquo at different distances

Number of different nodes at a given distance

UK 18 mill nodes EUINT 860000 nodes

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

00

01

02

03

5 10 15 20 25 30

Freq

uenc

y

Distance

Average distance Average distance149 clicks 100 clicks

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 23: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 24: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

High and low-ranked pages are different

1 5 10 15 200

2

4

6

8

10

12

x 104

Distance

Num

ber o

f Nod

es

Top 0minus10Top 40minus50Top 60minus70

Areas below the curves are equal if we are in the samestrongly-connected component

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 25: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Metrics

Graph algorithms

Streamed algorithms

Symmetric algorithms

All shortest paths centrality betweenness clustering coefficient

(Strongly) connected componentsApproximate count of neighborsPageRank Truncated PageRank Linear RankHITS Salsa TrustRank

Breadth-first and depth-first searchCount of neighbors

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 26: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 27: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 28: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 29: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General functional ranking

Let P the row-normalized version of the citation matrix of agraph G = (V E )A functional ranking [Baeza-Yates et al 2006] is alink-based ranking algorithm to compute a scoring vector Wof the form

W =infinsum

t=0

damping(t)

NPt

There are many choices for damping(t) including simply alinear function that is as good as PageRank in practice

damping(t) = (1minus α)αt

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 30: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 31: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank

Reduce the direct contribution of the first levels of links

damping(t) =

0 t le T

Cαt t gt T

V No extra reading of the graph after PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 32: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for

9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 33: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes 0 lt α lt 1 damping factor Tge minus1 distance fortruncation

1 for i 1 N do Initialization2 R[i] larr (1minus α)((αT+1)N)3 if Tge 0 then4 Score[i] larr 05 else Calculate normal PageRank6 Score[i] larr R[i]7 end if8 end for9 distance = 110 while not converged do11 Aux larr 012 for src 1 N do Follow links in the graph13 for all link from src to dest do14 Aux[dest] larr Aux[dest] + R[src]outdegree(src)15 end for16 end for17 for i 1 N do Apply damping factor α18 R[i] larr Aux[i] timesα19 if distance gt T then Add to ranking value20 Score[i] larr Score[i] + R[i]21 end if22 end for23 distance = distance +124 end while25 return Score

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 34: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Truncated PageRank vs PageRank

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

1

10minus8

10minus6

10minus4

10minus9

10minus8

10minus7

10minus6

10minus5

10minus4

10minus3

Normal PageRank

Tru

ncat

ed P

ageR

ank

T=

4

Comparing PageRank and Truncated PageRank with T = 1and T = 4The correlation is high and decreases as more levels aretruncated

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 35: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 36: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 37: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Probabilistic counting

100010

100010

110000

000110

000011

100010

100011

111100111111

100011

Count bits setto estimatesupporters

Target page

Propagation ofbits using the

ldquoORrdquo operation

100010

Improvement of ANF algorithm [Palmer et al 2002] based onprobabilistic counting [Flajolet and Martin 1985]

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 38: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for

4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 39: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for

13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 40: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

General algorithm

Require N number of nodes d distance k bits1 for node 1 N bit 1 k do2 INIT(nodebit)3 end for4 for distance 1 d do Iteration step5 Aux larr 0k

6 for src 1 N do Follow links in the graph7 for all links from src to dest do8 Aux[dest] larr Aux[dest] OR V[srcmiddot]9 end for

10 end for11 V larr Aux12 end for13 for node 1 N do Estimate supporters14 Supporters[node] larr ESTIMATE( V[nodemiddot] )15 end for16 return Supporters

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 41: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability ε

by the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 42: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 43: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)

Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 44: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node varies

This means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 45: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Our estimator

Initialize all bits to one with probability εby the independence of the i minus th component Xi rsquos we have

P[Xi = 1] = 1minus (1minus ε)neighbors(node)

Estimator neighbors(node) = log(1minusε)

(1minus ones(node)

k

)Problem neighbors(node) can vary by orders of magnitudesas node variesThis means that for some values of ε the computed value ofones(node) might be k (or 0 depending on neighbors(node))with relatively high probability

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 46: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Adaptive estimator

if we knew neighbors(node) and chose ε = 1neighbors(node) we

would get

ones(node) (

1minus 1

e

)k 063k

Adaptive estimation

Repeat the above process for ε = 12 14 18 and lookfor the transitions from more than (1minus 1e)k ones to lessthan (1minus 1e)k ones

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 47: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 48: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or less

less than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 49: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Convergence

5 10 15 200

10

20

30

40

50

60

70

80

90

100

Iteration

Frac

tion

of n

odes

with

est

imat

es

d=1d=2d=3d=4d=5d=6d=7d=8

15 iterations for estimating the neighbors at distance 4 or lessless than 25 iterations for all distances up to 8

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 50: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Error rate

1 2 3 4 5 6 7 80

01

02

03

04

05

512 btimesi 768 btimesi 832 btimesi 960 btimesi 1152 btimesi 1216 btimesi 1344 btimesi 1408 btimesi

576 btimesi1152 btimesi

512 btimesi768 btimesi

832 btimesi

960 btimesi

1152 btimesi

1216 btimesi1344 btimesi 1408 btimesi

Distance

Ave

rage

Rel

ativ

e E

rror

Ours 64 bits epsilonminusonly estimatorOurs 64 bits combined estimatorANF 24 bits times 24 iterations (576 btimesi)ANF 24 bits times 48 iterations (1152 btimesi)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 51: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 52: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 53: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Test collection

UK collection

185 million pages downloaded from the UK domain

5344 hosts manually classified (6 of the hosts)

Classified entire hosts

V A few hosts are mixed spam and non-spam pages

X More coverage sample covers 32 of the pages

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 54: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 55: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 56: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Automatic classifier

We extracted (for the home page and the page withmaximum PageRank) PageRank Truncated PageRank at2 4 Supporters at 2 4

We measured

Precision = of spam hosts classified as spam

of hosts classified as spam

Recall = of spam hosts classified as spam

of spam hosts

and the two types of errors in spam classification

False positive rate = of normal hosts classified as spam

of normal hosts

False negative rate = of spam hosts classified as normal

of spam hosts

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 57: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Single-technique classifier

Classifier based on TrustRank uses as features the PageRankthe estimated non-spam mass and theestimated non-spam mass divided by PageRank

Classifier based on Truncated PageRank uses as features thePageRank the Truncated PageRank withtruncation distance t = 2 3 4 (with t = 1 itwould be just based on in-degree) and theTruncated PageRank divided by PageRank

Classifier based on Estimation of Supporters uses as featuresthe PageRank the estimation of supporters at agiven distance d = 2 3 4 and the estimation ofsupporters divided by PageRank

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 58: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=5)

Classifiers Spam class False False(pruning with M = 5) Prec Recall Pos Neg

TrustRank 082 050 21 50

Trunc PageRank t = 2 085 050 16 50Trunc PageRank t = 3 084 047 16 53Trunc PageRank t = 4 079 045 22 55

Est Supporters d = 2 078 060 32 40Est Supporters d = 3 083 064 24 36Est Supporters d = 4 086 064 20 36

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 59: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Comparison of single-technique classifier (M=30)

Classifiers Spam class False False(pruning with M = 30) Prec Recall Pos Neg

TrustRank 080 049 23 51

Trunc PageRank t = 2 082 043 18 57Trunc PageRank t = 3 081 042 18 58Trunc PageRank t = 4 077 043 24 57

Est Supporters d = 2 076 052 31 48Est Supporters d = 3 083 057 21 43Est Supporters d = 4 080 057 26 43

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 60: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Combined classifier

Spam class False FalsePruning Rules Precision Recall Pos Neg

M=5 49 087 080 20 20M=30 31 088 076 18 24

No pruning 189 085 079 26 21

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 61: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

1 Motivation2 Spam pages characterization3 Truncated PageRank4 Counting supporters5 Experiments6 Conclusions

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 62: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Summary of classifiers

(a) Precision and recall of spam detection

(b) Error rates of the spam classifiers

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 63: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 64: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 65: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 66: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 67: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 68: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 69: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Conclusions

V Link-based statistics to detect 80 of spam

X No magic bullet in link analysis

X Precision still low compared to e-mail spam filters

V Measure both home page and max PageRank page

V Host-based counts are important

Next step combine link analysis and content analysis

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 70: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Thank you

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 71: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Baeza-Yates R Boldi P and Castillo C (2006)

Generalizing PageRank Damping functions for link-basedranking algorithms

In Proceedings of SIGIR Seattle Washington USA ACMPress

Becchetti L Castillo C Donato D Leonardi S andBaeza-Yates R (2006)

Using rank propagation and probabilistic counting forlink-based spam detection

In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD) Pennsylvania USA ACM Press

Benczur A A Csalogany K Sarlos T and Uher M(2005)

Spamrank fully automatic link spam detection

In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web Chiba Japan

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 72: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Fetterly D Manasse M and Najork M (2004)

Spam damn spam and statistics Using statistical analysis tolocate spam web pages

In Proceedings of the seventh workshop on the Web anddatabases (WebDB) pages 1ndash6 Paris France

Flajolet P and Martin N G (1985)

Probabilistic counting algorithms for data base applications

Journal of Computer and System Sciences 31(2)182ndash209

Gibson D Kumar R and Tomkins A (2005)

Discovering large dense subgraphs in massive graphs

In VLDB rsquo05 Proceedings of the 31st international conferenceon Very large data bases pages 721ndash732 VLDB Endowment

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 73: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Gyongyi Z and Garcia-Molina H (2005)

Web spam taxonomy

In First International Workshop on Adversarial InformationRetrieval on the Web

Gyongyi Z Molina H G and Pedersen J (2004)

Combating web spam with trustrank

In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB) pages 576ndash587 TorontoCanada Morgan Kaufmann

Newman M E Strogatz S H and Watts D J (2001)

Random graphs with arbitrary degree distributions and theirapplications

Phys Rev E Stat Nonlin Soft Matter Phys 64(2 Pt 2)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions
Page 74: Using Rank Propagation for Spam Detection (WebKDD 2006)

Using rankpropagation and

Probabilisticcounting forLink-Based

Spam Detection

L BecchettiC CastilloD Donato

S Leonardi andR Baeza-Yates

Motivation

Spam pagescharacterization

TruncatedPageRank

Countingsupporters

Experiments

Conclusions

Ntoulas A Najork M Manasse M and Fetterly D (2006)

Detecting spam web pages through content analysis

In Proceedings of the World Wide Web conference pages83ndash92 Edinburgh Scotland

Palmer C R Gibbons P B and Faloutsos C (2002)

ANF a fast and scalable tool for data mining in massivegraphs

In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining pages81ndash90 New York NY USA ACM Press

Perkins A (2001)

The classification of search engine spam

Available online athttpwwwsilverdisccoukarticlesspam-classification

  • Motivation
  • Spam pages characterization
  • Truncated PageRank
  • Counting supporters
  • Experiments
  • Conclusions